! pip install feature-engine
! pip install category_encoders
! pip install torch
Requirement already satisfied: feature-engine in c:\users\pando\anaconda3\lib\site-packages (1.6.2) Requirement already satisfied: numpy>=1.18.2 in c:\users\pando\anaconda3\lib\site-packages (from feature-engine) (1.24.3) Requirement already satisfied: pandas>=1.0.3 in c:\users\pando\anaconda3\lib\site-packages (from feature-engine) (2.1.4) Requirement already satisfied: scikit-learn>=1.0.0 in c:\users\pando\anaconda3\lib\site-packages (from feature-engine) (1.3.0) Requirement already satisfied: scipy>=1.4.1 in c:\users\pando\anaconda3\lib\site-packages (from feature-engine) (1.11.4) Requirement already satisfied: statsmodels>=0.11.1 in c:\users\pando\anaconda3\lib\site-packages (from feature-engine) (0.14.0) Requirement already satisfied: python-dateutil>=2.8.2 in c:\users\pando\anaconda3\lib\site-packages (from pandas>=1.0.3->feature-engine) (2.8.2) Requirement already satisfied: pytz>=2020.1 in c:\users\pando\anaconda3\lib\site-packages (from pandas>=1.0.3->feature-engine) (2023.3.post1) Requirement already satisfied: tzdata>=2022.1 in c:\users\pando\anaconda3\lib\site-packages (from pandas>=1.0.3->feature-engine) (2023.3) Requirement already satisfied: joblib>=1.1.1 in c:\users\pando\anaconda3\lib\site-packages (from scikit-learn>=1.0.0->feature-engine) (1.2.0) Requirement already satisfied: threadpoolctl>=2.0.0 in c:\users\pando\anaconda3\lib\site-packages (from scikit-learn>=1.0.0->feature-engine) (2.2.0) Requirement already satisfied: patsy>=0.5.2 in c:\users\pando\anaconda3\lib\site-packages (from statsmodels>=0.11.1->feature-engine) (0.5.3) Requirement already satisfied: packaging>=21.3 in c:\users\pando\anaconda3\lib\site-packages (from statsmodels>=0.11.1->feature-engine) (23.1) Requirement already satisfied: six in c:\users\pando\anaconda3\lib\site-packages (from patsy>=0.5.2->statsmodels>=0.11.1->feature-engine) (1.16.0)
ERROR: Invalid requirement: '#'
Requirement already satisfied: torch in c:\users\pando\anaconda3\lib\site-packages (2.2.1) Requirement already satisfied: filelock in c:\users\pando\anaconda3\lib\site-packages (from torch) (3.9.0) Requirement already satisfied: typing-extensions>=4.8.0 in c:\users\pando\anaconda3\lib\site-packages (from torch) (4.10.0) Requirement already satisfied: sympy in c:\users\pando\anaconda3\lib\site-packages (from torch) (1.11.1) Requirement already satisfied: networkx in c:\users\pando\anaconda3\lib\site-packages (from torch) (3.1) Requirement already satisfied: jinja2 in c:\users\pando\anaconda3\lib\site-packages (from torch) (3.1.2) Requirement already satisfied: fsspec in c:\users\pando\anaconda3\lib\site-packages (from torch) (2023.4.0) Requirement already satisfied: MarkupSafe>=2.0 in c:\users\pando\anaconda3\lib\site-packages (from jinja2->torch) (2.1.1) Requirement already satisfied: mpmath>=0.19 in c:\users\pando\anaconda3\lib\site-packages (from sympy->torch) (1.3.0)
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns
import numpy as np
from numpy import absolute, mean, std
import math
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import ShuffleSplit, cross_val_score
from sklearn.multioutput import MultiOutputRegressor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import RepeatedKFold
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.cluster import DBSCAN
from sklearn.metrics import silhouette_score
import torch
import torch.nn as nn
import torch.optim as optim
from feature_engine.creation import CyclicalFeatures
import category_encoders as ce
from scipy.special import softmax
import plotly.io as pio
import plotly.offline as pyo
pio.renderers.default = 'iframe'
pyo.init_notebook_mode()
standard_scaler = StandardScaler()
minmax_scaler = MinMaxScaler()
robust_scaler = RobustScaler()
data = pd.read_csv("elektronisk-rapportering-ers-2018-fangstmelding-dca-simple.csv", sep=";")
data.head()
| Melding ID | Meldingstidspunkt | Meldingsdato | Meldingsklokkeslett | Starttidspunkt | Startdato | Startklokkeslett | Startposisjon bredde | Startposisjon lengde | Hovedområde start (kode) | ... | Art - FDIR | Art - gruppe (kode) | Art - gruppe | Rundvekt | Lengdegruppe (kode) | Lengdegruppe | Bruttotonnasje 1969 | Bruttotonnasje annen | Bredde | Fartøylengde | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1497177 | 01.01.2018 | 01.01.2018 | 00:00 | 31.12.2017 | 31.12.2017 | 00:00 | -60,35 | -46,133 | NaN | ... | Antarktisk krill | 506.0 | Antarktisk krill | 706714.0 | 5.0 | 28 m og over | 9432.0 | NaN | 19,87 | 133,88 |
| 1 | 1497178 | 01.01.2018 | 01.01.2018 | 00:00 | 30.12.2017 23:21 | 30.12.2017 | 23:21 | 74,885 | 16,048 | 20.0 | ... | Hyse | 202.0 | Hyse | 9594.0 | 5.0 | 28 m og over | 1476.0 | NaN | 12,6 | 56,8 |
| 2 | 1497178 | 01.01.2018 | 01.01.2018 | 00:00 | 30.12.2017 23:21 | 30.12.2017 | 23:21 | 74,885 | 16,048 | 20.0 | ... | Torsk | 201.0 | Torsk | 8510.0 | 5.0 | 28 m og over | 1476.0 | NaN | 12,6 | 56,8 |
| 3 | 1497178 | 01.01.2018 | 01.01.2018 | 00:00 | 30.12.2017 23:21 | 30.12.2017 | 23:21 | 74,885 | 16,048 | 20.0 | ... | Blåkveite | 301.0 | Blåkveite | 196.0 | 5.0 | 28 m og over | 1476.0 | NaN | 12,6 | 56,8 |
| 4 | 1497178 | 01.01.2018 | 01.01.2018 | 00:00 | 30.12.2017 23:21 | 30.12.2017 | 23:21 | 74,885 | 16,048 | 20.0 | ... | Sei | 203.0 | Sei | 134.0 | 5.0 | 28 m og over | 1476.0 | NaN | 12,6 | 56,8 |
5 rows × 45 columns
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 305434 entries, 0 to 305433 Data columns (total 45 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Melding ID 305434 non-null int64 1 Meldingstidspunkt 305434 non-null object 2 Meldingsdato 305434 non-null object 3 Meldingsklokkeslett 305434 non-null object 4 Starttidspunkt 305434 non-null object 5 Startdato 305434 non-null object 6 Startklokkeslett 305434 non-null object 7 Startposisjon bredde 305434 non-null object 8 Startposisjon lengde 305434 non-null object 9 Hovedområde start (kode) 303433 non-null float64 10 Hovedområde start 301310 non-null object 11 Lokasjon start (kode) 303433 non-null float64 12 Havdybde start 305434 non-null int64 13 Stopptidspunkt 305434 non-null object 14 Stoppdato 305434 non-null object 15 Stoppklokkeslett 305434 non-null object 16 Varighet 305434 non-null int64 17 Fangstår 305434 non-null int64 18 Stopposisjon bredde 305434 non-null object 19 Stopposisjon lengde 305434 non-null object 20 Hovedområde stopp (kode) 303472 non-null float64 21 Hovedområde stopp 301310 non-null object 22 Lokasjon stopp (kode) 303472 non-null float64 23 Havdybde stopp 305434 non-null int64 24 Trekkavstand 305410 non-null float64 25 Redskap FAO (kode) 305434 non-null object 26 Redskap FAO 305246 non-null object 27 Redskap FDIR (kode) 305246 non-null float64 28 Redskap FDIR 305246 non-null object 29 Hovedart FAO (kode) 300456 non-null object 30 Hovedart FAO 300456 non-null object 31 Hovedart - FDIR (kode) 300456 non-null float64 32 Art FAO (kode) 300456 non-null object 33 Art FAO 300452 non-null object 34 Art - FDIR (kode) 300452 non-null float64 35 Art - FDIR 300452 non-null object 36 Art - gruppe (kode) 300452 non-null float64 37 Art - gruppe 300452 non-null object 38 Rundvekt 300456 non-null float64 39 Lengdegruppe (kode) 304750 non-null float64 40 Lengdegruppe 304750 non-null object 41 Bruttotonnasje 1969 234005 non-null float64 42 Bruttotonnasje annen 74774 non-null float64 43 Bredde 304750 non-null object 44 Fartøylengde 305434 non-null object dtypes: float64(13), int64(5), object(27) memory usage: 104.9+ MB
data.describe()
| Melding ID | Hovedområde start (kode) | Lokasjon start (kode) | Havdybde start | Varighet | Fangstår | Hovedområde stopp (kode) | Lokasjon stopp (kode) | Havdybde stopp | Trekkavstand | Redskap FDIR (kode) | Hovedart - FDIR (kode) | Art - FDIR (kode) | Art - gruppe (kode) | Rundvekt | Lengdegruppe (kode) | Bruttotonnasje 1969 | Bruttotonnasje annen | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 3.054340e+05 | 303433.000000 | 303433.000000 | 305434.000000 | 305434.000000 | 305434.000000 | 303472.000000 | 303472.000000 | 305434.000000 | 3.054100e+05 | 305246.000000 | 300456.000000 | 300452.000000 | 300452.000000 | 3.004560e+05 | 304750.000000 | 234005.000000 | 74774.000000 |
| mean | 1.658783e+06 | 14.463737 | 19.074712 | -228.025292 | 537.095526 | 2017.999941 | 14.430415 | 18.883353 | -229.084850 | 1.566397e+04 | 46.489746 | 1326.729934 | 1414.625914 | 259.746585 | 7.438208e+03 | 4.575032 | 1408.386975 | 186.172573 |
| std | 9.130738e+04 | 13.001244 | 18.469340 | 226.062493 | 2201.624688 | 0.007677 | 12.973150 | 18.361244 | 224.277365 | 9.033085e+04 | 13.534202 | 614.506560 | 633.188386 | 320.124913 | 4.281086e+04 | 0.692769 | 1148.384145 | 165.761157 |
| min | 1.497177e+06 | 0.000000 | 0.000000 | -5388.000000 | 0.000000 | 2017.000000 | 0.000000 | 0.000000 | -5388.000000 | 0.000000e+00 | 11.000000 | 412.000000 | 211.000000 | 101.000000 | 0.000000e+00 | 3.000000 | 104.000000 | 21.000000 |
| 25% | 1.567228e+06 | 5.000000 | 7.000000 | -273.000000 | 123.000000 | 2018.000000 | 5.000000 | 7.000000 | -274.000000 | 2.533000e+03 | 32.000000 | 1022.000000 | 1022.000000 | 201.000000 | 6.400000e+01 | 4.000000 | 496.000000 | 87.000000 |
| 50% | 1.674230e+06 | 8.000000 | 12.000000 | -196.000000 | 296.000000 | 2018.000000 | 8.000000 | 12.000000 | -198.000000 | 7.598000e+03 | 51.000000 | 1032.000000 | 1032.000000 | 203.000000 | 3.000000e+02 | 5.000000 | 1184.000000 | 149.000000 |
| 75% | 1.735590e+06 | 20.000000 | 24.000000 | -128.000000 | 494.000000 | 2018.000000 | 20.000000 | 24.000000 | -127.000000 | 2.259900e+04 | 55.000000 | 1038.000000 | 2202.000000 | 302.000000 | 2.236000e+03 | 5.000000 | 2053.000000 | 236.000000 |
| max | 1.800291e+06 | 81.000000 | 87.000000 | 1220.000000 | 125534.000000 | 2018.000000 | 81.000000 | 87.000000 | 1616.000000 | 1.588863e+07 | 80.000000 | 6619.000000 | 6619.000000 | 9903.000000 | 1.100000e+06 | 5.000000 | 9432.000000 | 1147.000000 |
data.iloc[100]
Melding ID 1497342 Meldingstidspunkt 01.01.2018 23:30 Meldingsdato 01.01.2018 Meldingsklokkeslett 23:30 Starttidspunkt 01.01.2018 07:58 Startdato 01.01.2018 Startklokkeslett 07:58 Startposisjon bredde 71,262 Startposisjon lengde 25,188 Hovedområde start (kode) 4.0 Hovedområde start Vest-Finnmark Lokasjon start (kode) 26.0 Havdybde start -289 Stopptidspunkt 01.01.2018 14:04 Stoppdato 01.01.2018 Stoppklokkeslett 14:04 Varighet 366 Fangstår 2018 Stopposisjon bredde 71,317 Stopposisjon lengde 25,225 Hovedområde stopp (kode) 4.0 Hovedområde stopp Vest-Finnmark Lokasjon stopp (kode) 26.0 Havdybde stopp -294 Trekkavstand 6278.0 Redskap FAO (kode) OTB Redskap FAO Bunntrål, otter Redskap FDIR (kode) 51.0 Redskap FDIR Bunntrål Hovedart FAO (kode) COD Hovedart FAO Torsk Hovedart - FDIR (kode) 1022.0 Art FAO (kode) HAD Art FAO Hyse Art - FDIR (kode) 1027.0 Art - FDIR Hyse Art - gruppe (kode) 202.0 Art - gruppe Hyse Rundvekt 580.0 Lengdegruppe (kode) 5.0 Lengdegruppe 28 m og over Bruttotonnasje 1969 691.0 Bruttotonnasje annen NaN Bredde 10,5 Fartøylengde 39,79 Name: 100, dtype: object
#Converting so that every "," goes to "."
coloumns_to_convert = ["Startposisjon bredde", "Startposisjon lengde", "Stopposisjon bredde", "Stopposisjon lengde", "Bredde", "Fartøylengde"]
data[coloumns_to_convert] = data[coloumns_to_convert].replace({',': '.'}, regex=True)
data.iloc[100]
Melding ID 1497342 Meldingstidspunkt 01.01.2018 23:30 Meldingsdato 01.01.2018 Meldingsklokkeslett 23:30 Starttidspunkt 01.01.2018 07:58 Startdato 01.01.2018 Startklokkeslett 07:58 Startposisjon bredde 71.262 Startposisjon lengde 25.188 Hovedområde start (kode) 4.0 Hovedområde start Vest-Finnmark Lokasjon start (kode) 26.0 Havdybde start -289 Stopptidspunkt 01.01.2018 14:04 Stoppdato 01.01.2018 Stoppklokkeslett 14:04 Varighet 366 Fangstår 2018 Stopposisjon bredde 71.317 Stopposisjon lengde 25.225 Hovedområde stopp (kode) 4.0 Hovedområde stopp Vest-Finnmark Lokasjon stopp (kode) 26.0 Havdybde stopp -294 Trekkavstand 6278.0 Redskap FAO (kode) OTB Redskap FAO Bunntrål, otter Redskap FDIR (kode) 51.0 Redskap FDIR Bunntrål Hovedart FAO (kode) COD Hovedart FAO Torsk Hovedart - FDIR (kode) 1022.0 Art FAO (kode) HAD Art FAO Hyse Art - FDIR (kode) 1027.0 Art - FDIR Hyse Art - gruppe (kode) 202.0 Art - gruppe Hyse Rundvekt 580.0 Lengdegruppe (kode) 5.0 Lengdegruppe 28 m og over Bruttotonnasje 1969 691.0 Bruttotonnasje annen NaN Bredde 10.5 Fartøylengde 39.79 Name: 100, dtype: object
# And we want to convert to floats
data[coloumns_to_convert] = data[coloumns_to_convert].astype(float)
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 305434 entries, 0 to 305433 Data columns (total 45 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Melding ID 305434 non-null int64 1 Meldingstidspunkt 305434 non-null object 2 Meldingsdato 305434 non-null object 3 Meldingsklokkeslett 305434 non-null object 4 Starttidspunkt 305434 non-null object 5 Startdato 305434 non-null object 6 Startklokkeslett 305434 non-null object 7 Startposisjon bredde 305434 non-null float64 8 Startposisjon lengde 305434 non-null float64 9 Hovedområde start (kode) 303433 non-null float64 10 Hovedområde start 301310 non-null object 11 Lokasjon start (kode) 303433 non-null float64 12 Havdybde start 305434 non-null int64 13 Stopptidspunkt 305434 non-null object 14 Stoppdato 305434 non-null object 15 Stoppklokkeslett 305434 non-null object 16 Varighet 305434 non-null int64 17 Fangstår 305434 non-null int64 18 Stopposisjon bredde 305434 non-null float64 19 Stopposisjon lengde 305434 non-null float64 20 Hovedområde stopp (kode) 303472 non-null float64 21 Hovedområde stopp 301310 non-null object 22 Lokasjon stopp (kode) 303472 non-null float64 23 Havdybde stopp 305434 non-null int64 24 Trekkavstand 305410 non-null float64 25 Redskap FAO (kode) 305434 non-null object 26 Redskap FAO 305246 non-null object 27 Redskap FDIR (kode) 305246 non-null float64 28 Redskap FDIR 305246 non-null object 29 Hovedart FAO (kode) 300456 non-null object 30 Hovedart FAO 300456 non-null object 31 Hovedart - FDIR (kode) 300456 non-null float64 32 Art FAO (kode) 300456 non-null object 33 Art FAO 300452 non-null object 34 Art - FDIR (kode) 300452 non-null float64 35 Art - FDIR 300452 non-null object 36 Art - gruppe (kode) 300452 non-null float64 37 Art - gruppe 300452 non-null object 38 Rundvekt 300456 non-null float64 39 Lengdegruppe (kode) 304750 non-null float64 40 Lengdegruppe 304750 non-null object 41 Bruttotonnasje 1969 234005 non-null float64 42 Bruttotonnasje annen 74774 non-null float64 43 Bredde 304750 non-null float64 44 Fartøylengde 305434 non-null float64 dtypes: float64(19), int64(5), object(21) memory usage: 104.9+ MB
We see now that out "Startposisjon bredde" and more, are now floats and not objects anymore, this will be easier to work with later
We want to replace some values that are not of significant for us, this by some createria:
data.isnull().sum()
Melding ID 0 Meldingstidspunkt 0 Meldingsdato 0 Meldingsklokkeslett 0 Starttidspunkt 0 Startdato 0 Startklokkeslett 0 Startposisjon bredde 0 Startposisjon lengde 0 Hovedområde start (kode) 2001 Hovedområde start 4124 Lokasjon start (kode) 2001 Havdybde start 0 Stopptidspunkt 0 Stoppdato 0 Stoppklokkeslett 0 Varighet 0 Fangstår 0 Stopposisjon bredde 0 Stopposisjon lengde 0 Hovedområde stopp (kode) 1962 Hovedområde stopp 4124 Lokasjon stopp (kode) 1962 Havdybde stopp 0 Trekkavstand 24 Redskap FAO (kode) 0 Redskap FAO 188 Redskap FDIR (kode) 188 Redskap FDIR 188 Hovedart FAO (kode) 4978 Hovedart FAO 4978 Hovedart - FDIR (kode) 4978 Art FAO (kode) 4978 Art FAO 4982 Art - FDIR (kode) 4982 Art - FDIR 4982 Art - gruppe (kode) 4982 Art - gruppe 4982 Rundvekt 4978 Lengdegruppe (kode) 684 Lengdegruppe 684 Bruttotonnasje 1969 71429 Bruttotonnasje annen 230660 Bredde 684 Fartøylengde 0 dtype: int64
columns_to_check = ['Rundvekt', 'Art FAO', 'Bredde', 'Art - FDIR']
# if any of these values have NaN-values we want to remove them, since we need them for our prediciton later
data.dropna(subset=columns_to_check, how='any', inplace=True)
data.isnull().sum()
Melding ID 0 Meldingstidspunkt 0 Meldingsdato 0 Meldingsklokkeslett 0 Starttidspunkt 0 Startdato 0 Startklokkeslett 0 Startposisjon bredde 0 Startposisjon lengde 0 Hovedområde start (kode) 1786 Hovedområde start 3760 Lokasjon start (kode) 1786 Havdybde start 0 Stopptidspunkt 0 Stoppdato 0 Stoppklokkeslett 0 Varighet 0 Fangstår 0 Stopposisjon bredde 0 Stopposisjon lengde 0 Hovedområde stopp (kode) 1760 Hovedområde stopp 3760 Lokasjon stopp (kode) 1760 Havdybde stopp 0 Trekkavstand 19 Redskap FAO (kode) 0 Redskap FAO 187 Redskap FDIR (kode) 187 Redskap FDIR 187 Hovedart FAO (kode) 0 Hovedart FAO 0 Hovedart - FDIR (kode) 0 Art FAO (kode) 0 Art FAO 0 Art - FDIR (kode) 0 Art - FDIR 0 Art - gruppe (kode) 0 Art - gruppe 0 Rundvekt 0 Lengdegruppe (kode) 0 Lengdegruppe 0 Bruttotonnasje 1969 69709 Bruttotonnasje annen 226267 Bredde 0 Fartøylengde 0 dtype: int64
Now that all the Art (species) values have no null values, we can look at the smaller specifics:
columns_to_check_location = ['Hovedområde start', 'Hovedområde stopp', 'Redskap FAO', 'Redskap FDIR']
data.dropna(subset=columns_to_check_location, how='any', inplace=True)
# Again dropping rows that have NaN values, since we need some of these features for our prediction later.
data.isnull().sum()
Melding ID 0 Meldingstidspunkt 0 Meldingsdato 0 Meldingsklokkeslett 0 Starttidspunkt 0 Startdato 0 Startklokkeslett 0 Startposisjon bredde 0 Startposisjon lengde 0 Hovedområde start (kode) 0 Hovedområde start 0 Lokasjon start (kode) 0 Havdybde start 0 Stopptidspunkt 0 Stoppdato 0 Stoppklokkeslett 0 Varighet 0 Fangstår 0 Stopposisjon bredde 0 Stopposisjon lengde 0 Hovedområde stopp (kode) 183 Hovedområde stopp 0 Lokasjon stopp (kode) 183 Havdybde stopp 0 Trekkavstand 19 Redskap FAO (kode) 0 Redskap FAO 0 Redskap FDIR (kode) 0 Redskap FDIR 0 Hovedart FAO (kode) 0 Hovedart FAO 0 Hovedart - FDIR (kode) 0 Art FAO (kode) 0 Art FAO 0 Art - FDIR (kode) 0 Art - FDIR 0 Art - gruppe (kode) 0 Art - gruppe 0 Rundvekt 0 Lengdegruppe (kode) 0 Lengdegruppe 0 Bruttotonnasje 1969 68636 Bruttotonnasje annen 223486 Bredde 0 Fartøylengde 0 dtype: int64
We know that we will remove the columns that have "(kode)" in them, since they dont help us with any significance, but we will do this later.
We will also drop the "Bruttotonnasje 1969" and "annen" since, they have so many missing values, and since we also dont need them for later, we will just remove them.
What do we then need?
# there are stwo days of data from dec. 2017, we choose to remove them for simplicity and since they dont actually represent the data (which is in 2018).
data['Startdato'] = pd.to_datetime(data['Startdato'], format='%d.%m.%Y')
data = data[data['Startdato'].dt.year != 2017]
data
| Melding ID | Meldingstidspunkt | Meldingsdato | Meldingsklokkeslett | Starttidspunkt | Startdato | Startklokkeslett | Startposisjon bredde | Startposisjon lengde | Hovedområde start (kode) | ... | Art - FDIR | Art - gruppe (kode) | Art - gruppe | Rundvekt | Lengdegruppe (kode) | Lengdegruppe | Bruttotonnasje 1969 | Bruttotonnasje annen | Bredde | Fartøylengde | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 19 | 1497229 | 01.01.2018 15:49 | 01.01.2018 | 15:49 | 01.01.2018 10:01 | 2018-01-01 | 10:01 | 67.828 | 12.972 | 5.0 | ... | Hyse | 202.0 | Hyse | 4.0 | 3.0 | 15-20,99 m | NaN | 51.0 | 5.06 | 19.10 |
| 20 | 1497229 | 01.01.2018 15:49 | 01.01.2018 | 15:49 | 01.01.2018 13:07 | 2018-01-01 | 13:07 | 67.826 | 12.967 | 5.0 | ... | Torsk | 201.0 | Torsk | 1800.0 | 3.0 | 15-20,99 m | NaN | 51.0 | 5.06 | 19.10 |
| 21 | 1497229 | 01.01.2018 15:49 | 01.01.2018 | 15:49 | 01.01.2018 13:07 | 2018-01-01 | 13:07 | 67.826 | 12.967 | 5.0 | ... | Rødspette | 320.0 | Annen flatfisk, bunnfisk og dypvannsfisk | 50.0 | 3.0 | 15-20,99 m | NaN | 51.0 | 5.06 | 19.10 |
| 22 | 1497249 | 01.01.2018 17:36 | 01.01.2018 | 17:36 | 01.01.2018 01:19 | 2018-01-01 | 01:19 | 74.811 | 36.665 | 15.0 | ... | Snøkrabbe | 501.0 | Snøkrabbe | 217.0 | 5.0 | 28 m og over | NaN | 1101.0 | 11.20 | 49.95 |
| 23 | 1497249 | 01.01.2018 17:36 | 01.01.2018 | 17:36 | 01.01.2018 03:04 | 2018-01-01 | 03:04 | 74.835 | 36.744 | 15.0 | ... | Snøkrabbe | 501.0 | Snøkrabbe | 217.0 | 5.0 | 28 m og over | NaN | 1101.0 | 11.20 | 49.95 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 305429 | 1800291 | 01.01.2019 09:28 | 01.01.2019 | 09:28 | 31.12.2018 19:41 | 2018-12-31 | 19:41 | 76.906 | 12.709 | 21.0 | ... | Gråsteinbit | 304.0 | Steinbiter | 145.0 | 5.0 | 28 m og over | 1483.0 | NaN | 12.60 | 57.30 |
| 305430 | 1800291 | 01.01.2019 09:28 | 01.01.2019 | 09:28 | 31.12.2018 19:41 | 2018-12-31 | 19:41 | 76.906 | 12.709 | 21.0 | ... | Uer (vanlig) | 302.0 | Uer | 136.0 | 5.0 | 28 m og over | 1483.0 | NaN | 12.60 | 57.30 |
| 305431 | 1800291 | 01.01.2019 09:28 | 01.01.2019 | 09:28 | 31.12.2018 19:41 | 2018-12-31 | 19:41 | 76.906 | 12.709 | 21.0 | ... | Flekksteinbit | 304.0 | Steinbiter | 132.0 | 5.0 | 28 m og over | 1483.0 | NaN | 12.60 | 57.30 |
| 305432 | 1800291 | 01.01.2019 09:28 | 01.01.2019 | 09:28 | 31.12.2018 19:41 | 2018-12-31 | 19:41 | 76.906 | 12.709 | 21.0 | ... | Snabeluer | 302.0 | Uer | 102.0 | 5.0 | 28 m og over | 1483.0 | NaN | 12.60 | 57.30 |
| 305433 | 1800291 | 01.01.2019 09:28 | 01.01.2019 | 09:28 | 31.12.2018 19:41 | 2018-12-31 | 19:41 | 76.906 | 12.709 | 21.0 | ... | Blåkveite | 301.0 | Blåkveite | 63.0 | 5.0 | 28 m og over | 1483.0 | NaN | 12.60 | 57.30 |
295734 rows × 45 columns
We are trying to predict what the most common species are, from different data like their geographical location (lat/log) and other data, we are then going to predict a list with values from 0 - 1 (0% - 100%) that sum up to 1 / 100%, to then represent what the chances are for these species in the list to be the most common. This could also show intuitivly what species most likely are here, in this location by these other variables and so on, showing a representation of what these species are by chance.
We want to split our data so that data from each month is getting to our training data and test data, this will make the prediction later easier.
# get unique months, 1 - 12 -> 2018.
unique_months = data['Startdato'].dt.month.unique()
unique_months = sorted(unique_months)
# group messageID's together, we dont want to split up inside the groups themselves
groups = data.groupby('Melding ID')
melding_id_month_map = {}
# create a map for all months, and their corresponding groups that belong in them!
for month in unique_months:
melding_ids = []
for name, group in groups:
group_month = group['Startdato'].dt.month
if month in group_month.unique():
melding_ids.append(name)
melding_id_month_map[month] = melding_ids
#print(melding_id_month_map) - see if all months have values, beware: alot of values and runtime is slow!
train_melding_ids_by_month = {}
test_melding_ids_by_month = {}
#splitting the data inside each month
for month, melding_ids in melding_id_month_map.items():
train_melding_ids, test_melding_ids = train_test_split(melding_ids, test_size=0.2, random_state=42)
#storing each melding ID's to either training or test, for later extraction.
train_melding_ids_by_month[month] = train_melding_ids
test_melding_ids_by_month[month] = test_melding_ids
# start off with empty dataframes
train_data = pd.DataFrame()
test_data = pd.DataFrame()
# add all training data and all test data to their dataframes.
for month, train_melding_ids in train_melding_ids_by_month.items():
train_month_data = data[data['Melding ID'].isin(train_melding_ids)]
train_data = pd.concat([train_data, train_month_data])
for month, test_melding_ids in test_melding_ids_by_month.items():
test_month_data = data[data['Melding ID'].isin(test_melding_ids)]
test_data = pd.concat([test_data, test_month_data])
# reset index for both dataframes to make sure their indexes are correct
train_data.reset_index(drop=True, inplace=True)
test_data.reset_index(drop=True, inplace=True)
train_data
| Melding ID | Meldingstidspunkt | Meldingsdato | Meldingsklokkeslett | Starttidspunkt | Startdato | Startklokkeslett | Startposisjon bredde | Startposisjon lengde | Hovedområde start (kode) | ... | Art - FDIR | Art - gruppe (kode) | Art - gruppe | Rundvekt | Lengdegruppe (kode) | Lengdegruppe | Bruttotonnasje 1969 | Bruttotonnasje annen | Bredde | Fartøylengde | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1497249 | 01.01.2018 17:36 | 01.01.2018 | 17:36 | 01.01.2018 01:19 | 2018-01-01 | 01:19 | 74.811 | 36.665 | 15.0 | ... | Snøkrabbe | 501.0 | Snøkrabbe | 217.0 | 5.0 | 28 m og over | NaN | 1101.0 | 11.2 | 49.95 |
| 1 | 1497249 | 01.01.2018 17:36 | 01.01.2018 | 17:36 | 01.01.2018 03:04 | 2018-01-01 | 03:04 | 74.835 | 36.744 | 15.0 | ... | Snøkrabbe | 501.0 | Snøkrabbe | 217.0 | 5.0 | 28 m og over | NaN | 1101.0 | 11.2 | 49.95 |
| 2 | 1497249 | 01.01.2018 17:36 | 01.01.2018 | 17:36 | 01.01.2018 11:57 | 2018-01-01 | 11:57 | 74.828 | 36.865 | 15.0 | ... | Snøkrabbe | 501.0 | Snøkrabbe | 217.0 | 5.0 | 28 m og over | NaN | 1101.0 | 11.2 | 49.95 |
| 3 | 1497249 | 01.01.2018 17:36 | 01.01.2018 | 17:36 | 01.01.2018 11:57 | 2018-01-01 | 11:57 | 74.828 | 36.866 | 15.0 | ... | Snøkrabbe | 501.0 | Snøkrabbe | 220.0 | 5.0 | 28 m og over | NaN | 1101.0 | 11.2 | 49.95 |
| 4 | 1497288 | 01.01.2018 21:02 | 01.01.2018 | 21:02 | 01.01.2018 05:47 | 2018-01-01 | 05:47 | 69.744 | 16.516 | 5.0 | ... | Sei | 203.0 | Sei | 2895.0 | 4.0 | 21-27,99 m | NaN | 354.0 | 9.0 | 27.49 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 239798 | 1800291 | 01.01.2019 09:28 | 01.01.2019 | 09:28 | 31.12.2018 19:41 | 2018-12-31 | 19:41 | 76.906 | 12.709 | 21.0 | ... | Gråsteinbit | 304.0 | Steinbiter | 145.0 | 5.0 | 28 m og over | 1483.0 | NaN | 12.6 | 57.30 |
| 239799 | 1800291 | 01.01.2019 09:28 | 01.01.2019 | 09:28 | 31.12.2018 19:41 | 2018-12-31 | 19:41 | 76.906 | 12.709 | 21.0 | ... | Uer (vanlig) | 302.0 | Uer | 136.0 | 5.0 | 28 m og over | 1483.0 | NaN | 12.6 | 57.30 |
| 239800 | 1800291 | 01.01.2019 09:28 | 01.01.2019 | 09:28 | 31.12.2018 19:41 | 2018-12-31 | 19:41 | 76.906 | 12.709 | 21.0 | ... | Flekksteinbit | 304.0 | Steinbiter | 132.0 | 5.0 | 28 m og over | 1483.0 | NaN | 12.6 | 57.30 |
| 239801 | 1800291 | 01.01.2019 09:28 | 01.01.2019 | 09:28 | 31.12.2018 19:41 | 2018-12-31 | 19:41 | 76.906 | 12.709 | 21.0 | ... | Snabeluer | 302.0 | Uer | 102.0 | 5.0 | 28 m og over | 1483.0 | NaN | 12.6 | 57.30 |
| 239802 | 1800291 | 01.01.2019 09:28 | 01.01.2019 | 09:28 | 31.12.2018 19:41 | 2018-12-31 | 19:41 | 76.906 | 12.709 | 21.0 | ... | Blåkveite | 301.0 | Blåkveite | 63.0 | 5.0 | 28 m og over | 1483.0 | NaN | 12.6 | 57.30 |
239803 rows × 45 columns
Note, look at their amount of rows, the rest will be our test set's amount
We are just going to group them after their Message ID, since one trip can have (and in most cases has) multiple and the same Message ID, we can group them so that each trip will (hopefully) only be one row, but we will make sure of this later (*see "classifying species
grouped_data_train = train_data.groupby('Melding ID')
We want to just group them for now (after their messageID, since we want all the data from one trip / expedition) since we want to find out the most common species, and visualize this later.
type(grouped_data_train)
pandas.core.groupby.generic.DataFrameGroupBy
grouped_data_train.groups
{1497249: [0, 1, 2, 3], 1497288: [4, 5, 6, 7, 8, 9], 1497306: [10, 11, 12, 13, 14], 1497310: [15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27], 1497311: [28, 29, 30], 1497312: [31], 1497321: [32, 33], 1497323: [34, 35, 36, 37, 38], 1497326: [39, 40, 41, 42, 43, 44], 1497330: [45, 46, 47, 48, 49, 50, 51], 1497332: [52, 53, 54, 55, 56], 1497341: [57, 58, 59], 1497344: [60, 61], 1497350: [62, 63], 1497352: [64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75], 1497353: [76, 77, 78], 1497354: [79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89], 1497358: [90, 91], 1497362: [92, 93, 94, 95], 1497364: [96, 97, 98, 99, 100, 101], 1497368: [102, 103, 104, 105, 106, 107, 108], 1497384: [109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119], 1497414: [120, 121, 122, 123], 1497421: [124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141], 1497432: [142, 143, 144, 145, 146, 147, 148], 1497433: [149, 150], 1497435: [151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167], 1497444: [168, 169, 170, 171], 1497448: [172, 173, 174, 175, 176, 177], 1497475: [178, 179, 180, 181], 1497482: [182, 183, 184, 185, 186], 1497484: [187, 188, 189], 1497495: [190, 191, 192, 193], 1497505: [194, 195, 196], 1497537: [197, 198, 199, 200, 201], 1497550: [202, 203, 204], 1497555: [205, 206, 207, 208, 209], 1497556: [210, 211, 212, 213], 1497559: [214, 215, 216, 217, 218, 219], 1497589: [220], 1497600: [221], 1497621: [222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233], 1497628: [234, 235, 236, 237, 238, 239, 240, 241, 242, 243], 1497631: [244, 245], 1497659: [246, 247, 248, 249], 1497681: [250, 251, 252, 253, 254, 255, 256, 257, 258, 259, 260, 261], 1497686: [262, 263, 264], 1497692: [265, 266, 267, 268, 269], 1497719: [270, 271, 272, 273, 274, 275, 276], 1497720: [277, 278, 279, 280, 281, 282, 283], 1497758: [284, 285, 286, 287, 288, 289, 290, 291, 292, 293, 294, 295, 296, 297, 298], 1497776: [299, 300, 301, 302, 303, 304], 1497784: [305, 306, 307, 308, 309, 310, 311, 312, 313, 314], 1497789: [315, 316, 317, 318, 319, 320, 321], 1497801: [322, 323, 324], 1497803: [325, 326, 327, 328, 329, 330, 331], 1497805: [332, 333, 334], 1497812: [335, 336, 337], 1497816: [338, 339, 340, 341, 342, 343, 344], 1497818: [345, 346, 347, 348, 349], 1497823: [350, 351, 352, 353, 354, 355], 1497824: [356, 357, 358, 359, 360, 361, 362, 363, 364, 365, 366, 367, 368, 369, 370, 371, 372, 373, 374], 1497827: [375, 376, 377, 378, 379], 1497833: [380, 381, 382, 383, 384, 385, 386], 1497836: [387], 1497838: [388], 1497839: [389, 390, 391, 392, 393, 394, 395, 396, 397, 398, 399, 400, 401], 1497843: [402, 403, 404], 1497846: [405, 406, 407, 408, 409, 410, 411, 412, 413, 414, 415], 1497848: [416, 417, 418], 1497849: [419, 420, 421, 422, 423], 1497852: [424, 425, 426], 1497856: [427, 428, 429, 430, 431], 1497860: [432, 433, 434, 435, 436, 437, 438, 439, 440, 441], 1497863: [442, 443, 444, 445, 446, 447, 448, 449, 450, 451], 1497869: [452, 453, 454, 455, 456, 457, 458, 459, 460, 461, 462, 463, 464], 1497873: [465], 1497875: [466, 467, 468, 469, 470, 471, 472, 473, 474, 475, 476], 1497904: [477, 478], 1497907: [479, 480], 1497918: [481, 482, 483, 484, 485, 486, 487, 488, 489, 490, 491, 492, 493], 1497923: [494, 495, 496, 497, 498, 499], 1497924: [500, 501, 502, 503, 504, 505], 1497925: [506, 507, 508, 509, 510, 511, 512, 513, 514, 515, 516], 1497938: [517, 518, 519, 520], 1497941: [521, 522], 1497943: [523, 524, 525, 526], 1497944: [527, 528, 529, 530, 531, 532], 1497962: [533, 534, 535, 536, 537, 538, 539, 540, 541, 542, 543, 544, 545, 546, 547], 1497965: [548, 549, 550, 551, 552, 553, 554, 555, 556, 557, 558, 559, 560, 561, 562, 563, 564, 565, 566, 567, 568, 569, 570, 571], 1497968: [572, 573, 574, 575, 576, 577, 578, 579, 580, 581, 582, 583, 584, 585, 586, 587, 588, 589, 590, 591, 592, 593, 594, 595, 596], 1497974: [597, 598, 599, 600, 601, 602, 603, 604, 605, 606, 607, 608, 609, 610, 611, 612, 613, 614, 615, 616, 617, 618, 619, 620, 621, 622, 623, 624, 625], 1497976: [626, 627, 628, 629, 630, 631, 632, 633, 634, 635, 636, 637, 638, 639], 1497983: [640, 641, 642, 643, 644, 645, 646, 647, 648, 649, 650, 651, 652, 653, 654, 655, 656, 657, 658], 1497991: [659, 660, 661, 662], 1497994: [663, 664, 665, 666], 1498002: [667, 668, 669, 670, 671, 672, 673, 674, 675, 676, 677, 678, 679, 680, 681, 682, 683, 684, 685, 686], 1498041: [687, 688, 689], 1498045: [690, 691, 692, 693, 694, 695, 696], 1498050: [697, 698, 699, 700, 701, 702, 703], ...}
We want to only predict some species, the most common ones, but the most common here, we define as the ones that have been seen the most in the groups themselves, so counted up the most, not necessary the most caught in terms of weight, same logic applies to our tools (see below)
species_counts_train = {}
tools_counts_train = {}
for group_name, group_data in grouped_data_train:
# Counting up all the tools and Species, how much they have been "seen"
species_counts_group = group_data['Art - FDIR'].value_counts()
tools_counts_group = group_data['Redskap FDIR'].value_counts()
for species, count in species_counts_group.items():
species_counts_train[species] = species_counts_train.get(species, 0) + count
for tool, count in tools_counts_group.items():
tools_counts_train[tool] = tools_counts_train.get(tool, 0) + count
# Converting to dataframes to easily work with later
species_counts_train_df = pd.DataFrame(list(species_counts_train.items()), columns=['Species', 'Total_Count'])
tools_counts_train_df = pd.DataFrame(list(tools_counts_train.items()), columns=['Tool', 'Total_Count'])
# sorting them, so that we get the most counted at the top
sorted_species_counts = species_counts_train_df.sort_values(by='Total_Count', ascending=False).reset_index(drop=True)
sorted_tools_counts = tools_counts_train_df.sort_values(by='Total_Count', ascending=False).reset_index(drop=True)
(See above) We are doing this to keep track of the most common species, which we want to visualize (see under)
sorted_species_counts
| Species | Total_Count | |
|---|---|---|
| 0 | Torsk | 45214 |
| 1 | Sei | 34207 |
| 2 | Hyse | 31263 |
| 3 | Lange | 13977 |
| 4 | Uer (vanlig) | 11443 |
| ... | ... | ... |
| 109 | Annen vanlig ti-armet blekksprut | 1 |
| 110 | Rundskate | 1 |
| 111 | Ansjos | 1 |
| 112 | Rød kråkebolle | 1 |
| 113 | Bukstripet pelamide | 1 |
114 rows × 2 columns
Same with tools (see under)
sorted_tools_counts
| Tool | Total_Count | |
|---|---|---|
| 0 | Bunntrål | 98313 |
| 1 | Snurrevad | 40068 |
| 2 | Andre liner | 35029 |
| 3 | Reketrål | 17891 |
| 4 | Udefinert garn | 15160 |
| 5 | Udefinert trål | 12319 |
| 6 | Snurpenot/ringnot | 7751 |
| 7 | Teiner | 5075 |
| 8 | Bunntrål par | 2455 |
| 9 | Dobbeltrål | 2330 |
| 10 | Flytetrål | 1493 |
| 11 | Flytetrål par | 1127 |
| 12 | Settegarn | 525 |
| 13 | Harpun og lignende uspesifiserte typer | 238 |
| 14 | Juksa/pilk | 17 |
| 15 | Dorg/harp/snik | 12 |
We only want to use a portion of both the tools and species, that can be defined here, see under.
most_common_range = 10
sorted_species_counts
| Species | Total_Count | |
|---|---|---|
| 0 | Torsk | 45214 |
| 1 | Sei | 34207 |
| 2 | Hyse | 31263 |
| 3 | Lange | 13977 |
| 4 | Uer (vanlig) | 11443 |
| ... | ... | ... |
| 109 | Annen vanlig ti-armet blekksprut | 1 |
| 110 | Rundskate | 1 |
| 111 | Ansjos | 1 |
| 112 | Rød kråkebolle | 1 |
| 113 | Bukstripet pelamide | 1 |
114 rows × 2 columns
most_common_species = sorted_species_counts[:most_common_range]
other_most_common_species = sorted_species_counts[most_common_range:]
We use Seaborn's barplots, see their official documentation for more: https://seaborn.pydata.org/generated/seaborn.barplot.html
plt.figure(figsize=(12, 6))
ax = sns.barplot(x='Species', y='Total_Count', data=most_common_species)
ax.bar_label(ax.containers[0], fontsize=10); # their counts.
plt.xlabel('Species')
plt.ylabel('Count')
plt.title(f'The {most_common_range} Most Common Species')
plt.xticks(rotation=15, ha='right')
plt.show()
set(most_common_species["Species"])
{'Blåkveite',
'Breiflabb',
'Brosme',
'Dypvannsreke',
'Hyse',
'Lange',
'Lysing',
'Sei',
'Torsk',
'Uer (vanlig)'}
sorted_tools_counts
| Tool | Total_Count | |
|---|---|---|
| 0 | Bunntrål | 98313 |
| 1 | Snurrevad | 40068 |
| 2 | Andre liner | 35029 |
| 3 | Reketrål | 17891 |
| 4 | Udefinert garn | 15160 |
| 5 | Udefinert trål | 12319 |
| 6 | Snurpenot/ringnot | 7751 |
| 7 | Teiner | 5075 |
| 8 | Bunntrål par | 2455 |
| 9 | Dobbeltrål | 2330 |
| 10 | Flytetrål | 1493 |
| 11 | Flytetrål par | 1127 |
| 12 | Settegarn | 525 |
| 13 | Harpun og lignende uspesifiserte typer | 238 |
| 14 | Juksa/pilk | 17 |
| 15 | Dorg/harp/snik | 12 |
most_common_tools = sorted_tools_counts[:most_common_range]
other_most_common_tools = sorted_tools_counts[most_common_range:]
plt.figure(figsize=(12, 6))
ax = sns.barplot(x='Tool', y='Total_Count', data=most_common_tools)
ax.bar_label(ax.containers[0], fontsize=10); # their counts.
plt.xlabel('Tool')
plt.ylabel('Count')
plt.title(f'The {most_common_range} Most Common Tools')
plt.xticks(rotation=15, ha='right')
plt.show()
import random
#Visualize a specific group, as an example, just to see the "normal" data in a group
group_keys = list(grouped_data_train.groups.keys())
random_group_key = random.choice(group_keys)
specific_group = grouped_data_train.get_group(random_group_key)
specific_group
| Melding ID | Meldingstidspunkt | Meldingsdato | Meldingsklokkeslett | Starttidspunkt | Startdato | Startklokkeslett | Startposisjon bredde | Startposisjon lengde | Hovedområde start (kode) | ... | Art - FDIR | Art - gruppe (kode) | Art - gruppe | Rundvekt | Lengdegruppe (kode) | Lengdegruppe | Bruttotonnasje 1969 | Bruttotonnasje annen | Bredde | Fartøylengde | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 65832 | 1584854 | 01.04.2018 23:05 | 01.04.2018 | 23:05 | 01.04.2018 23:03 | 2018-04-01 | 23:03 | 67.669 | 12.247 | 5.0 | ... | Torsk | 201.0 | Torsk | 17580.0 | 5.0 | 28 m og over | 902.0 | NaN | 7.4 | 39.9 |
| 65833 | 1584854 | 01.04.2018 23:05 | 01.04.2018 | 23:05 | 01.04.2018 23:03 | 2018-04-01 | 23:03 | 67.669 | 12.247 | 5.0 | ... | Lange | 220.0 | Annen torskefisk | 67.0 | 5.0 | 28 m og over | 902.0 | NaN | 7.4 | 39.9 |
2 rows × 45 columns
From here, we see that each catch, has the same time, showing off the different species, and that when they have another catch they again have a different time, we want to save these, but as less rows, we will come back to this later.
unique_startdates_counts = []
for name, group in grouped_data_train:
unique_dates_count = len(group['Startdato'].unique())
if unique_dates_count < 2:
# ignore
break
elif unique_dates_count >= 2:
unique_startdates_counts.append((name, unique_dates_count))
unique_startdates_counts
[]
We are making sure that there isnt a real big difference in the days inside each group, only one group has 2 different days(which we know are a day where its late, and it goes over to the next day, so this wont make a big difference), now we can group them togehter looking after their time, if the time is different inside the group we can make a new row, meaning that if they have the same time all the species have been caught at the same time, if not, then they have been caught at different times:
We are going to classify our species, so that only the most common species/tools have their own names and the rest will be other
grouped_data_train["Hovedart FAO"].head()
0 Snøkrabbe
1 Snøkrabbe
2 Snøkrabbe
3 Snøkrabbe
4 Sei
...
239776 Hyse
239777 Hyse
239778 Hyse
239779 Hyse
239780 Hyse
Name: Hovedart FAO, Length: 145027, dtype: object
type(grouped_data_train["Hovedart FAO"])
pandas.core.groupby.generic.SeriesGroupBy
most_common_species["Species"]
0 Torsk 1 Sei 2 Hyse 3 Lange 4 Uer (vanlig) 5 Dypvannsreke 6 Brosme 7 Lysing 8 Breiflabb 9 Blåkveite Name: Species, dtype: object
Now we also need our grouped test data, since these are universal changes that need to be applied:
grouped_data_test = test_data.groupby('Melding ID')
grouped_data_test.groups
{1497229: [0, 1, 2], 1497314: [3, 4, 5, 6], 1497342: [7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19], 1497351: [20, 21], 1497377: [22, 23], 1497383: [24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39], 1497422: [40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52], 1497423: [53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67], 1497424: [68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79], 1497426: [80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92], 1497514: [93], 1497531: [94, 95, 96, 97, 98, 99], 1497562: [100, 101, 102], 1497581: [103, 104, 105, 106, 107, 108, 109], 1497634: [110, 111, 112, 113, 114, 115], 1497685: [116, 117, 118, 119, 120, 121], 1497687: [122, 123, 124, 125], 1497743: [126, 127, 128, 129], 1497779: [130, 131], 1497809: [132, 133, 134, 135, 136], 1497820: [137, 138, 139, 140, 141], 1497841: [142, 143, 144, 145, 146, 147], 1497850: [148, 149, 150], 1497857: [151, 152, 153, 154, 155, 156, 157], 1497883: [158, 159], 1497933: [160, 161, 162, 163, 164, 165, 166, 167, 168], 1497985: [169, 170, 171, 172, 173], 1498026: [174, 175, 176, 177, 178], 1498158: [179, 180, 181, 182, 183, 184], 1498204: [185, 186, 187, 188, 189], 1498252: [190, 191, 192, 193, 194], 1498317: [195, 196, 197, 198, 199, 200], 1498341: [201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212], 1498353: [213, 214, 215, 216, 217], 1498382: [218, 219, 220], 1498403: [221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234], 1498438: [235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245], 1498442: [246, 247, 248, 249, 250, 251, 252, 253, 254, 255, 256, 257], 1498448: [258, 259, 260, 261], 1498452: [262, 263, 264, 265, 266, 267, 268], 1498461: [269, 270, 271, 272, 273, 274, 275, 276], 1498463: [277, 278, 279, 280, 281, 282, 283, 284], 1498483: [285, 286, 287, 288, 289, 290, 291], 1498484: [292, 293, 294, 295, 296, 297, 298, 299, 300, 301, 302, 303, 304, 305, 306, 307, 308], 1498494: [309, 310, 311, 312, 313, 314, 315, 316, 317, 318, 319, 320], 1498520: [321, 322, 323, 324, 325, 326, 327, 328, 329, 330, 331, 332, 333, 334, 335, 336, 337, 338, 339], 1498525: [340], 1498543: [341, 342, 343, 344], 1498622: [345, 346], 1498708: [347, 348, 349, 350, 351, 352, 353, 354, 355, 356, 357, 358], 1498736: [359, 360, 361, 362, 363, 364, 365, 366, 367, 368, 369, 370, 371], 1498777: [372, 373], 1498790: [374, 375, 376, 377], 1498811: [378, 379, 380, 381, 382, 383, 384], 1498856: [385, 386, 387, 388], 1498892: [389, 390, 391], 1498893: [392, 393, 394, 395], 1498894: [396, 397, 398], 1498918: [399, 400, 401, 402], 1498927: [403, 404, 405, 406, 407, 408, 409, 410, 411, 412, 413, 414, 415, 416, 417, 418, 419, 420, 421, 422], 1498933: [423, 424, 425, 426, 427], 1498937: [428, 429, 430, 431, 432, 433], 1498940: [434, 435, 436, 437, 438], 1498943: [439, 440, 441, 442, 443, 444, 445, 446, 447, 448, 449, 450, 451, 452, 453, 454, 455, 456, 457, 458, 459, 460, 461, 462, 463, 464], 1498949: [465, 466, 467, 468, 469, 470, 471], 1498955: [472, 473, 474, 475, 476, 477], 1498965: [478, 479, 480], 1498969: [481], 1498975: [482, 483, 484, 485], 1498990: [486], 1498995: [487, 488, 489, 490, 491, 492, 493, 494, 495], 1499009: [496], 1499023: [497, 498, 499, 500, 501, 502, 503], 1499029: [504, 505, 506, 507, 508, 509, 510, 511, 512, 513, 514, 515, 516, 517, 518], 1499046: [519, 520, 521, 522, 523, 524, 525, 526, 527, 528], 1499049: [529], 1499066: [530, 531, 532, 533, 534, 535, 536, 537], 1499070: [538, 539, 540, 541, 542, 543, 544, 545, 546, 547, 548, 549, 550, 551, 552, 553, 554, 555, 556, 557, 558], 1499119: [559, 560, 561, 562, 563, 564, 565, 566, 567, 568, 569, 570, 571, 572, 573, 574, 575, 576, 577, 578, 579, 580, 581, 582, 583], 1499158: [584, 585, 586, 587, 588, 589], 1499173: [590, 591, 592, 593, 594, 595], 1499214: [596, 597], 1499221: [598, 599, 600, 601, 602, 603, 604], 1499239: [605], 1499252: [606, 607, 608, 609], 1499280: [610, 611], 1499312: [612, 613], 1499432: [614, 615, 616, 617, 618, 619, 620, 621, 622], 1499436: [623, 624, 625, 626], 1499470: [627, 628, 629, 630], 1499481: [631], 1499488: [632, 633, 634, 635, 636], 1499516: [637, 638, 639, 640, 641, 642, 643, 644, 645, 646, 647, 648, 649, 650, 651], 1499521: [652, 653, 654, 655, 656, 657, 658, 659, 660, 661, 662, 663, 664], 1499578: [665, 666, 667, 668, 669], 1499593: [670, 671, 672, 673], 1499661: [674, 675, 676, 677, 678, 679], 1499681: [680, 681, 682, 683, 684, 685, 686, 687, 688, 689, 690, 691, 692, 693, 694, 695, 696, 697, 698, 699, 700, 701], 1499687: [702, 703, 704, 705, 706, 707, 708, 709, 710, 711, 712, 713, 714, 715, 716, 717, 718, 719, 720], 1499694: [721, 722, 723, 724], ...}
# beware! This code takes a while... (up to a few min.)
most_common_species_set = set(most_common_species["Species"])
most_common_tools_set = set(most_common_tools["Tool"])
def update_species_classification(group):
group["Art - FDIR"] = group["Art - FDIR"].apply(lambda x: x if x in most_common_species_set else 'Other')
group["Hovedart FAO"] = group["Hovedart FAO"].apply(lambda x: x if x in most_common_species_set else 'Other')
group["Redskap FDIR"] = group["Redskap FDIR"].apply(lambda x: x if x in most_common_tools_set else 'Other')
# if it's in common species/tools we keep it, else set it to "Other"
return group
# Apply to both training and test data
updated_group_data_train = grouped_data_train.apply(update_species_classification).reset_index(drop=True)
updated_group_data_test = grouped_data_test.apply(update_species_classification).reset_index(drop=True)
updated_group_data_train #We see a few instanses of "other". But we see that we again group our data, but still maintain our dataframe object!
| Melding ID | Meldingstidspunkt | Meldingsdato | Meldingsklokkeslett | Starttidspunkt | Startdato | Startklokkeslett | Startposisjon bredde | Startposisjon lengde | Hovedområde start (kode) | ... | Art - FDIR | Art - gruppe (kode) | Art - gruppe | Rundvekt | Lengdegruppe (kode) | Lengdegruppe | Bruttotonnasje 1969 | Bruttotonnasje annen | Bredde | Fartøylengde | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1497249 | 01.01.2018 17:36 | 01.01.2018 | 17:36 | 01.01.2018 01:19 | 2018-01-01 | 01:19 | 74.811 | 36.665 | 15.0 | ... | Other | 501.0 | Snøkrabbe | 217.0 | 5.0 | 28 m og over | NaN | 1101.0 | 11.2 | 49.95 |
| 1 | 1497249 | 01.01.2018 17:36 | 01.01.2018 | 17:36 | 01.01.2018 03:04 | 2018-01-01 | 03:04 | 74.835 | 36.744 | 15.0 | ... | Other | 501.0 | Snøkrabbe | 217.0 | 5.0 | 28 m og over | NaN | 1101.0 | 11.2 | 49.95 |
| 2 | 1497249 | 01.01.2018 17:36 | 01.01.2018 | 17:36 | 01.01.2018 11:57 | 2018-01-01 | 11:57 | 74.828 | 36.865 | 15.0 | ... | Other | 501.0 | Snøkrabbe | 217.0 | 5.0 | 28 m og over | NaN | 1101.0 | 11.2 | 49.95 |
| 3 | 1497249 | 01.01.2018 17:36 | 01.01.2018 | 17:36 | 01.01.2018 11:57 | 2018-01-01 | 11:57 | 74.828 | 36.866 | 15.0 | ... | Other | 501.0 | Snøkrabbe | 220.0 | 5.0 | 28 m og over | NaN | 1101.0 | 11.2 | 49.95 |
| 4 | 1497288 | 01.01.2018 21:02 | 01.01.2018 | 21:02 | 01.01.2018 05:47 | 2018-01-01 | 05:47 | 69.744 | 16.516 | 5.0 | ... | Sei | 203.0 | Sei | 2895.0 | 4.0 | 21-27,99 m | NaN | 354.0 | 9.0 | 27.49 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 239798 | 1800291 | 01.01.2019 09:28 | 01.01.2019 | 09:28 | 31.12.2018 19:41 | 2018-12-31 | 19:41 | 76.906 | 12.709 | 21.0 | ... | Other | 304.0 | Steinbiter | 145.0 | 5.0 | 28 m og over | 1483.0 | NaN | 12.6 | 57.30 |
| 239799 | 1800291 | 01.01.2019 09:28 | 01.01.2019 | 09:28 | 31.12.2018 19:41 | 2018-12-31 | 19:41 | 76.906 | 12.709 | 21.0 | ... | Uer (vanlig) | 302.0 | Uer | 136.0 | 5.0 | 28 m og over | 1483.0 | NaN | 12.6 | 57.30 |
| 239800 | 1800291 | 01.01.2019 09:28 | 01.01.2019 | 09:28 | 31.12.2018 19:41 | 2018-12-31 | 19:41 | 76.906 | 12.709 | 21.0 | ... | Other | 304.0 | Steinbiter | 132.0 | 5.0 | 28 m og over | 1483.0 | NaN | 12.6 | 57.30 |
| 239801 | 1800291 | 01.01.2019 09:28 | 01.01.2019 | 09:28 | 31.12.2018 19:41 | 2018-12-31 | 19:41 | 76.906 | 12.709 | 21.0 | ... | Other | 302.0 | Uer | 102.0 | 5.0 | 28 m og over | 1483.0 | NaN | 12.6 | 57.30 |
| 239802 | 1800291 | 01.01.2019 09:28 | 01.01.2019 | 09:28 | 31.12.2018 19:41 | 2018-12-31 | 19:41 | 76.906 | 12.709 | 21.0 | ... | Blåkveite | 301.0 | Blåkveite | 63.0 | 5.0 | 28 m og over | 1483.0 | NaN | 12.6 | 57.30 |
239803 rows × 45 columns
type(updated_group_data_train)
pandas.core.frame.DataFrame
updated_group_data_train["Hovedart FAO"]
0 Other
1 Other
2 Other
3 Other
4 Sei
...
239798 Hyse
239799 Hyse
239800 Hyse
239801 Hyse
239802 Hyse
Name: Hovedart FAO, Length: 239803, dtype: object
Now printing it out to see if it did an effect, and we do see some instances of the label: "other"!
updated_group_data_train["Art - FDIR"].head(100).to_string()
'0 Other\n1 Other\n2 Other\n3 Other\n4 Sei\n5 Torsk\n6 Uer (vanlig)\n7 Lange\n8 Hyse\n9 Other\n10 Torsk\n11 Hyse\n12 Other\n13 Other\n14 Uer (vanlig)\n15 Lange\n16 Other\n17 Sei\n18 Torsk\n19 Lange\n20 Other\n21 Sei\n22 Torsk\n23 Sei\n24 Other\n25 Other\n26 Breiflabb\n27 Torsk\n28 Torsk\n29 Hyse\n30 Other\n31 Torsk\n32 Sei\n33 Torsk\n34 Lysing\n35 Sei\n36 Lysing\n37 Sei\n38 Lange\n39 Hyse\n40 Torsk\n41 Other\n42 Other\n43 Uer (vanlig)\n44 Brosme\n45 Torsk\n46 Torsk\n47 Torsk\n48 Torsk\n49 Torsk\n50 Uer (vanlig)\n51 Other\n52 Torsk\n53 Hyse\n54 Other\n55 Other\n56 Uer (vanlig)\n57 Sei\n58 Sei\n59 Sei\n60 Torsk\n61 Hyse\n62 Sei\n63 Torsk\n64 Torsk\n65 Hyse\n66 Uer (vanlig)\n67 Sei\n68 Torsk\n69 Hyse\n70 Uer (vanlig)\n71 Sei\n72 Torsk\n73 Hyse\n74 Uer (vanlig)\n75 Sei\n76 Torsk\n77 Hyse\n78 Other\n79 Sei\n80 Torsk\n81 Hyse\n82 Uer (vanlig)\n83 Other\n84 Sei\n85 Torsk\n86 Hyse\n87 Uer (vanlig)\n88 Torsk\n89 Hyse\n90 Sei\n91 Sei\n92 Torsk\n93 Other\n94 Other\n95 Blåkveite\n96 Torsk\n97 Sei\n98 Hyse\n99 Torsk'
updated_group_data_train["Hovedart FAO"].head(100).to_string()
'0 Other\n1 Other\n2 Other\n3 Other\n4 Sei\n5 Sei\n6 Sei\n7 Sei\n8 Sei\n9 Sei\n10 Torsk\n11 Torsk\n12 Torsk\n13 Torsk\n14 Torsk\n15 Lange\n16 Lange\n17 Lange\n18 Lange\n19 Lange\n20 Lange\n21 Lange\n22 Lange\n23 Sei\n24 Sei\n25 Sei\n26 Sei\n27 Sei\n28 Torsk\n29 Torsk\n30 Torsk\n31 Torsk\n32 Sei\n33 Sei\n34 Lysing\n35 Lysing\n36 Lysing\n37 Lysing\n38 Lysing\n39 Hyse\n40 Hyse\n41 Hyse\n42 Hyse\n43 Hyse\n44 Hyse\n45 Torsk\n46 Torsk\n47 Torsk\n48 Torsk\n49 Torsk\n50 Torsk\n51 Torsk\n52 Torsk\n53 Torsk\n54 Torsk\n55 Torsk\n56 Torsk\n57 Sei\n58 Sei\n59 Sei\n60 Torsk\n61 Torsk\n62 Sei\n63 Sei\n64 Torsk\n65 Torsk\n66 Torsk\n67 Torsk\n68 Torsk\n69 Torsk\n70 Torsk\n71 Torsk\n72 Torsk\n73 Torsk\n74 Torsk\n75 Torsk\n76 Torsk\n77 Torsk\n78 Torsk\n79 Sei\n80 Sei\n81 Sei\n82 Sei\n83 Sei\n84 Sei\n85 Sei\n86 Sei\n87 Sei\n88 Torsk\n89 Torsk\n90 Sei\n91 Sei\n92 Torsk\n93 Torsk\n94 Torsk\n95 Torsk\n96 Torsk\n97 Torsk\n98 Torsk\n99 Torsk'
updated_group_data_train["Redskap FDIR"].head(150).to_string()
'0 Teiner\n1 Teiner\n2 Teiner\n3 Teiner\n4 Udefinert garn\n5 Udefinert garn\n6 Udefinert garn\n7 Udefinert garn\n8 Udefinert garn\n9 Udefinert garn\n10 Andre liner\n11 Andre liner\n12 Andre liner\n13 Andre liner\n14 Andre liner\n15 Dobbeltrål\n16 Dobbeltrål\n17 Dobbeltrål\n18 Dobbeltrål\n19 Dobbeltrål\n20 Dobbeltrål\n21 Dobbeltrål\n22 Dobbeltrål\n23 Dobbeltrål\n24 Dobbeltrål\n25 Dobbeltrål\n26 Dobbeltrål\n27 Dobbeltrål\n28 Bunntrål\n29 Bunntrål\n30 Bunntrål\n31 Andre liner\n32 Snurrevad\n33 Snurrevad\n34 Udefinert trål\n35 Udefinert trål\n36 Udefinert trål\n37 Udefinert trål\n38 Udefinert trål\n39 Andre liner\n40 Andre liner\n41 Andre liner\n42 Andre liner\n43 Andre liner\n44 Andre liner\n45 Bunntrål\n46 Bunntrål\n47 Bunntrål\n48 Bunntrål\n49 Bunntrål\n50 Bunntrål\n51 Bunntrål\n52 Andre liner\n53 Andre liner\n54 Andre liner\n55 Andre liner\n56 Andre liner\n57 Bunntrål\n58 Bunntrål\n59 Bunntrål\n60 Andre liner\n61 Andre liner\n62 Snurrevad\n63 Snurrevad\n64 Bunntrål\n65 Bunntrål\n66 Bunntrål\n67 Bunntrål\n68 Bunntrål\n69 Bunntrål\n70 Bunntrål\n71 Bunntrål\n72 Bunntrål\n73 Bunntrål\n74 Bunntrål\n75 Bunntrål\n76 Andre liner\n77 Andre liner\n78 Andre liner\n79 Bunntrål\n80 Bunntrål\n81 Bunntrål\n82 Bunntrål\n83 Bunntrål\n84 Bunntrål\n85 Bunntrål\n86 Bunntrål\n87 Bunntrål\n88 Bunntrål\n89 Bunntrål\n90 Snurrevad\n91 Snurrevad\n92 Andre liner\n93 Andre liner\n94 Andre liner\n95 Andre liner\n96 Bunntrål\n97 Bunntrål\n98 Bunntrål\n99 Bunntrål\n100 Bunntrål\n101 Bunntrål\n102 Andre liner\n103 Andre liner\n104 Andre liner\n105 Andre liner\n106 Andre liner\n107 Andre liner\n108 Andre liner\n109 Bunntrål par\n110 Bunntrål par\n111 Bunntrål par\n112 Bunntrål par\n113 Bunntrål par\n114 Bunntrål par\n115 Bunntrål par\n116 Bunntrål par\n117 Bunntrål par\n118 Bunntrål par\n119 Bunntrål par\n120 Snurrevad\n121 Snurrevad\n122 Snurrevad\n123 Snurrevad\n124 Bunntrål\n125 Bunntrål\n126 Bunntrål\n127 Bunntrål\n128 Bunntrål\n129 Bunntrål\n130 Bunntrål\n131 Bunntrål\n132 Bunntrål\n133 Bunntrål\n134 Bunntrål\n135 Bunntrål\n136 Bunntrål\n137 Bunntrål\n138 Bunntrål\n139 Bunntrål\n140 Bunntrål\n141 Bunntrål\n142 Bunntrål\n143 Bunntrål\n144 Bunntrål\n145 Bunntrål\n146 Bunntrål\n147 Bunntrål\n148 Bunntrål\n149 Bunntrål'
# Grouping onnce again
Grouped_data_train = updated_group_data_train.groupby("Melding ID")
Grouped_data_test = updated_group_data_test.groupby("Melding ID")
Grouped_data_train.head()
| Melding ID | Meldingstidspunkt | Meldingsdato | Meldingsklokkeslett | Starttidspunkt | Startdato | Startklokkeslett | Startposisjon bredde | Startposisjon lengde | Hovedområde start (kode) | ... | Art - FDIR | Art - gruppe (kode) | Art - gruppe | Rundvekt | Lengdegruppe (kode) | Lengdegruppe | Bruttotonnasje 1969 | Bruttotonnasje annen | Bredde | Fartøylengde | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1497249 | 01.01.2018 17:36 | 01.01.2018 | 17:36 | 01.01.2018 01:19 | 2018-01-01 | 01:19 | 74.811 | 36.665 | 15.0 | ... | Other | 501.0 | Snøkrabbe | 217.0 | 5.0 | 28 m og over | NaN | 1101.0 | 11.2 | 49.95 |
| 1 | 1497249 | 01.01.2018 17:36 | 01.01.2018 | 17:36 | 01.01.2018 03:04 | 2018-01-01 | 03:04 | 74.835 | 36.744 | 15.0 | ... | Other | 501.0 | Snøkrabbe | 217.0 | 5.0 | 28 m og over | NaN | 1101.0 | 11.2 | 49.95 |
| 2 | 1497249 | 01.01.2018 17:36 | 01.01.2018 | 17:36 | 01.01.2018 11:57 | 2018-01-01 | 11:57 | 74.828 | 36.865 | 15.0 | ... | Other | 501.0 | Snøkrabbe | 217.0 | 5.0 | 28 m og over | NaN | 1101.0 | 11.2 | 49.95 |
| 3 | 1497249 | 01.01.2018 17:36 | 01.01.2018 | 17:36 | 01.01.2018 11:57 | 2018-01-01 | 11:57 | 74.828 | 36.866 | 15.0 | ... | Other | 501.0 | Snøkrabbe | 220.0 | 5.0 | 28 m og over | NaN | 1101.0 | 11.2 | 49.95 |
| 4 | 1497288 | 01.01.2018 21:02 | 01.01.2018 | 21:02 | 01.01.2018 05:47 | 2018-01-01 | 05:47 | 69.744 | 16.516 | 5.0 | ... | Sei | 203.0 | Sei | 2895.0 | 4.0 | 21-27,99 m | NaN | 354.0 | 9.0 | 27.49 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 239776 | 1800291 | 01.01.2019 09:28 | 01.01.2019 | 09:28 | 30.12.2018 23:20 | 2018-12-30 | 23:20 | 76.509 | 14.295 | 21.0 | ... | Hyse | 202.0 | Hyse | 7277.0 | 5.0 | 28 m og over | 1483.0 | NaN | 12.6 | 57.30 |
| 239777 | 1800291 | 01.01.2019 09:28 | 01.01.2019 | 09:28 | 30.12.2018 23:20 | 2018-12-30 | 23:20 | 76.509 | 14.295 | 21.0 | ... | Torsk | 201.0 | Torsk | 3126.0 | 5.0 | 28 m og over | 1483.0 | NaN | 12.6 | 57.30 |
| 239778 | 1800291 | 01.01.2019 09:28 | 01.01.2019 | 09:28 | 30.12.2018 23:20 | 2018-12-30 | 23:20 | 76.509 | 14.295 | 21.0 | ... | Blåkveite | 301.0 | Blåkveite | 315.0 | 5.0 | 28 m og over | 1483.0 | NaN | 12.6 | 57.30 |
| 239779 | 1800291 | 01.01.2019 09:28 | 01.01.2019 | 09:28 | 30.12.2018 23:20 | 2018-12-30 | 23:20 | 76.509 | 14.295 | 21.0 | ... | Other | 304.0 | Steinbiter | 145.0 | 5.0 | 28 m og over | 1483.0 | NaN | 12.6 | 57.30 |
| 239780 | 1800291 | 01.01.2019 09:28 | 01.01.2019 | 09:28 | 30.12.2018 23:20 | 2018-12-30 | 23:20 | 76.509 | 14.295 | 21.0 | ... | Other | 304.0 | Steinbiter | 132.0 | 5.0 | 28 m og over | 1483.0 | NaN | 12.6 | 57.30 |
145027 rows × 45 columns
Keeping a dataframe of the most import information for now, some info for visualization and some for futher exploration
species_order = ['Torsk', 'Sei', 'Hyse', 'Lange', 'Uer(vanlig)', 'Dypvannsreke', 'Other'] #Defined afte some of the most common (see previous sections)
def process_grouped_data(grouped_data_gen, species_order):
result_rows = []
for name, group in grouped_data_gen:
common_info = {
'Melding ID': name,
'latitude': group['Startposisjon bredde'].iloc[0],
'longitude': group['Startposisjon lengde'].iloc[0],
'main_species': group['Hovedart FAO'].iloc[0], # only for visualization
'vessel_ratio(height/width)': group['Fartøylengde'].iloc[0] / group['Bredde'].iloc[0],
'start_date': group['Startdato'].iloc[0],
'time_duration': group['Varighet'].iloc[0],
'total_weight': group['Rundvekt'].sum(), # only for visualization
'times': group['Startklokkeslett'].iloc[0],
'tools_used': group['Redskap FDIR'].iloc[0],
'species_weights_list': [group.loc[group['Art - FDIR'] == species, 'Rundvekt'].sum() for species in species_order] # target feature
}
result_rows.append(common_info)
result_df = pd.DataFrame(result_rows)
return result_df
# for both the training and test set here:
result_df_train = process_grouped_data(Grouped_data_train, species_order)
result_df_test = process_grouped_data(Grouped_data_test, species_order)
result_df_train
| Melding ID | latitude | longitude | main_species | vessel_ratio(height/width) | start_date | time_duration | total_weight | times | tools_used | species_weights_list | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1497249 | 74.811 | 36.665 | Other | 4.459821 | 2018-01-01 | 101 | 871.0 | 01:19 | Teiner | [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 871.0] |
| 1 | 1497288 | 69.744 | 16.516 | Sei | 3.054444 | 2018-01-01 | 881 | 5304.0 | 05:47 | Udefinert garn | [2100.0, 2895.0, 54.0, 95.0, 0.0, 0.0, 16.0] |
| 2 | 1497306 | 72.866 | 29.105 | Torsk | 4.658000 | 2018-01-01 | 900 | 11321.0 | 07:00 | Andre liner | [8371.0, 0.0, 2257.0, 0.0, 0.0, 0.0, 660.0] |
| 3 | 1497310 | 58.636 | 0.876 | Lange | 3.467143 | 2018-01-01 | 249 | 2994.0 | 07:09 | Dobbeltrål | [188.0, 480.0, 0.0, 1392.0, 0.0, 0.0, 874.0] |
| 4 | 1497311 | 73.127 | 28.324 | Torsk | 4.014286 | 2018-01-01 | 87 | 4131.0 | 17:09 | Bunntrål | [3850.0, 0.0, 202.0, 0.0, 0.0, 0.0, 79.0] |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 36795 | 1800267 | 72.840 | 28.893 | Torsk | 4.148438 | 2018-12-31 | 1138 | 28453.0 | 04:00 | Andre liner | [22110.0, 0.0, 6343.0, 0.0, 0.0, 0.0, 0.0] |
| 36796 | 1800269 | 70.844 | 50.071 | Hyse | 4.271429 | 2018-12-31 | 1226 | 25363.0 | 02:34 | Andre liner | [10107.0, 0.0, 15201.0, 0.0, 0.0, 0.0, 55.0] |
| 36797 | 1800285 | 74.892 | 17.255 | Torsk | 4.410256 | 2018-12-31 | 317 | 29247.0 | 00:26 | Bunntrål | [20316.0, 0.0, 7303.0, 0.0, 0.0, 0.0, 667.0] |
| 36798 | 1800286 | 70.888 | 22.321 | Sei | 3.789524 | 2018-12-31 | 152 | 20262.0 | 09:50 | Bunntrål | [4117.0, 15749.0, 258.0, 0.0, 0.0, 0.0, 138.0] |
| 36799 | 1800291 | 76.509 | 14.295 | Hyse | 4.547619 | 2018-12-30 | 301 | 45742.0 | 23:20 | Bunntrål | [16725.0, 32.0, 27144.0, 0.0, 0.0, 0.0, 998.0] |
36800 rows × 11 columns
One thing to mention is our species_order (above), this order will be maintained troughout our project and we will later explore what we will do with this data. Also see that some of our times, just have a singular item in them, we will keep this in mind for later. (See Encoding later)
And another thing is our time variable, we are taking the first one out of the group, this represents their first time of actually starting the expedition (the first catch).
type(result_df_train)
pandas.core.frame.DataFrame
result_df_train["species_weights_list"]
0 [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 871.0]
1 [2100.0, 2895.0, 54.0, 95.0, 0.0, 0.0, 16.0]
2 [8371.0, 0.0, 2257.0, 0.0, 0.0, 0.0, 660.0]
3 [188.0, 480.0, 0.0, 1392.0, 0.0, 0.0, 874.0]
4 [3850.0, 0.0, 202.0, 0.0, 0.0, 0.0, 79.0]
...
36795 [22110.0, 0.0, 6343.0, 0.0, 0.0, 0.0, 0.0]
36796 [10107.0, 0.0, 15201.0, 0.0, 0.0, 0.0, 55.0]
36797 [20316.0, 0.0, 7303.0, 0.0, 0.0, 0.0, 667.0]
36798 [4117.0, 15749.0, 258.0, 0.0, 0.0, 0.0, 138.0]
36799 [16725.0, 32.0, 27144.0, 0.0, 0.0, 0.0, 998.0]
Name: species_weights_list, Length: 36800, dtype: object
This group data will be mainly used for representing and visualizing, so beware of its name, since it only has one element per trip, which we want to use!
group_data = result_df_train.groupby('Melding ID')
group_data.groups
{1497249: [0], 1497288: [1], 1497306: [2], 1497310: [3], 1497311: [4], 1497312: [5], 1497321: [6], 1497323: [7], 1497326: [8], 1497330: [9], 1497332: [10], 1497341: [11], 1497344: [12], 1497350: [13], 1497352: [14], 1497353: [15], 1497354: [16], 1497358: [17], 1497362: [18], 1497364: [19], 1497368: [20], 1497384: [21], 1497414: [22], 1497421: [23], 1497432: [24], 1497433: [25], 1497435: [26], 1497444: [27], 1497448: [28], 1497475: [29], 1497482: [30], 1497484: [31], 1497495: [32], 1497505: [33], 1497537: [34], 1497550: [35], 1497555: [36], 1497556: [37], 1497559: [38], 1497589: [39], 1497600: [40], 1497621: [41], 1497628: [42], 1497631: [43], 1497659: [44], 1497681: [45], 1497686: [46], 1497692: [47], 1497719: [48], 1497720: [49], 1497758: [50], 1497776: [51], 1497784: [52], 1497789: [53], 1497801: [54], 1497803: [55], 1497805: [56], 1497812: [57], 1497816: [58], 1497818: [59], 1497823: [60], 1497824: [61], 1497827: [62], 1497833: [63], 1497836: [64], 1497838: [65], 1497839: [66], 1497843: [67], 1497846: [68], 1497848: [69], 1497849: [70], 1497852: [71], 1497856: [72], 1497860: [73], 1497863: [74], 1497869: [75], 1497873: [76], 1497875: [77], 1497904: [78], 1497907: [79], 1497918: [80], 1497923: [81], 1497924: [82], 1497925: [83], 1497938: [84], 1497941: [85], 1497943: [86], 1497944: [87], 1497962: [88], 1497965: [89], 1497968: [90], 1497974: [91], 1497976: [92], 1497983: [93], 1497991: [94], 1497994: [95], 1498002: [96], 1498041: [97], 1498045: [98], 1498050: [99], ...}
group_data.head()
| Melding ID | latitude | longitude | main_species | vessel_ratio(height/width) | start_date | time_duration | total_weight | times | tools_used | species_weights_list | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1497249 | 74.811 | 36.665 | Other | 4.459821 | 2018-01-01 | 101 | 871.0 | 01:19 | Teiner | [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 871.0] |
| 1 | 1497288 | 69.744 | 16.516 | Sei | 3.054444 | 2018-01-01 | 881 | 5304.0 | 05:47 | Udefinert garn | [2100.0, 2895.0, 54.0, 95.0, 0.0, 0.0, 16.0] |
| 2 | 1497306 | 72.866 | 29.105 | Torsk | 4.658000 | 2018-01-01 | 900 | 11321.0 | 07:00 | Andre liner | [8371.0, 0.0, 2257.0, 0.0, 0.0, 0.0, 660.0] |
| 3 | 1497310 | 58.636 | 0.876 | Lange | 3.467143 | 2018-01-01 | 249 | 2994.0 | 07:09 | Dobbeltrål | [188.0, 480.0, 0.0, 1392.0, 0.0, 0.0, 874.0] |
| 4 | 1497311 | 73.127 | 28.324 | Torsk | 4.014286 | 2018-01-01 | 87 | 4131.0 | 17:09 | Bunntrål | [3850.0, 0.0, 202.0, 0.0, 0.0, 0.0, 79.0] |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 36795 | 1800267 | 72.840 | 28.893 | Torsk | 4.148438 | 2018-12-31 | 1138 | 28453.0 | 04:00 | Andre liner | [22110.0, 0.0, 6343.0, 0.0, 0.0, 0.0, 0.0] |
| 36796 | 1800269 | 70.844 | 50.071 | Hyse | 4.271429 | 2018-12-31 | 1226 | 25363.0 | 02:34 | Andre liner | [10107.0, 0.0, 15201.0, 0.0, 0.0, 0.0, 55.0] |
| 36797 | 1800285 | 74.892 | 17.255 | Torsk | 4.410256 | 2018-12-31 | 317 | 29247.0 | 00:26 | Bunntrål | [20316.0, 0.0, 7303.0, 0.0, 0.0, 0.0, 667.0] |
| 36798 | 1800286 | 70.888 | 22.321 | Sei | 3.789524 | 2018-12-31 | 152 | 20262.0 | 09:50 | Bunntrål | [4117.0, 15749.0, 258.0, 0.0, 0.0, 0.0, 138.0] |
| 36799 | 1800291 | 76.509 | 14.295 | Hyse | 4.547619 | 2018-12-30 | 301 | 45742.0 | 23:20 | Bunntrål | [16725.0, 32.0, 27144.0, 0.0, 0.0, 0.0, 998.0] |
36800 rows × 11 columns
Much better, now actually showing each group with the same "Melding ID" as one group, containing multiple catches and their corresponding species and weights, and so on...
We want the whole dataset, as it is now, to use for our unsupervised learning algorithm, which we can combine here:
combined_df = pd.concat([result_df_train, result_df_test], ignore_index=True)
It would be beneficial to show some of the information, like the Vessel Ratio comparied to the total weight for each group, to be represented.
total_catch_by_group = group_data["total_weight"].sum()
vessel_size_by_group = group_data["vessel_ratio(height/width)"].first()
scatter_data = pd.DataFrame({ # only for representation, for now.
'Melding ID': total_catch_by_group.index,
'Total Weight': total_catch_by_group.values,
'Vessel Ratio': vessel_size_by_group.values
})
plt.figure(figsize=(20, 15))
sns.scatterplot(x='Vessel Ratio', y='Total Weight', hue='Total Weight', data=scatter_data, palette='viridis')
plt.title('Scatter Plot of Vessel Ratio vs. Total Weight for Group')
plt.xlabel('Vessel Ratio')
plt.ylabel('Total Weight')
plt.legend(title='Weight')
plt.show()
This isnt really giving us the best representation out there, but it is a good start! Since most of these groups will be a lot smaller than some of the really bigger ones, its hard to represent them all in one plot together, so this will be our starting point.
scatter_data.head()
| Melding ID | Total Weight | Vessel Ratio | |
|---|---|---|---|
| 0 | 1497249 | 871.0 | 4.459821 |
| 1 | 1497288 | 5304.0 | 3.054444 |
| 2 | 1497306 | 11321.0 | 4.658000 |
| 3 | 1497310 | 2994.0 | 3.467143 |
| 4 | 1497311 | 4131.0 | 4.014286 |
We want a simple heatmap distribution, we are going to do this with Plotly's map, this will give us a map that will be detailed with some hover information and give us good insight, where most of our (training) data lies, this is all from: https://plotly.com/python/maps/
fig = px.density_mapbox(
result_df_train,
lat='latitude',
lon='longitude',
z='total_weight',
hover_data=['main_species','Melding ID'],
radius=10,
zoom=3,
height=300
)
fig.update_layout(
mapbox_style="open-street-map",
margin={"r": 0, "t": 0, "l": 0, "b": 0}
)
fig.show()
It would also be hepful to get it to a scatter map, like a bubble map and show our data here: (again from Plotly, see previous section for more info)
sampled_data = result_df_train.sample(frac=0.05)
fig = px.scatter_geo(
sampled_data,
lat='latitude',
lon='longitude',
color="main_species",
hover_name="Melding ID",
size="total_weight",
projection="natural earth",
scope="europe"
)
fig.show()
I recommend troughout this chapter to read upon FeatureEngines documentation about cyclical encoding, as i will follow it closely, Read more: https://feature-engine.trainindata.com/en/latest/user_guide/creation/CyclicalFeatures.html#
We are going to cyclical encode our time, both the months, date and hours, we will not take year (or years) in consideration, as all in this project is happening in 2018.
cyclical = CyclicalFeatures(variables = None)
def transform_from_hour_to_cyclical(df):
test_dataframe = df.to_frame() # to dataframe
values_test = test_dataframe["times"].values # its values
reshape_test = values_test.reshape(-1,1) # reshape to 1D format
hours_list = np.char.split(reshape_test.astype(str), ":").tolist() # Split to ["Hour", "Minute"]
hours_list = [value[0][:1] for value in hours_list] # keep only ["Hour"]
df_hours = pd.DataFrame(hours_list, columns=['Hour']) # create new dataframe with "Hour"-col.
df_hours['Hour'] = pd.to_numeric(df_hours['Hour']) # ensure its numeric.
return df_hours
new_hour_df_test = transform_from_hour_to_cyclical(result_df_train["times"])
test = cyclical.fit_transform(new_hour_df_test[["Hour"]]) #selecting dataframe!
test.head()
| Hour | Hour_sin | Hour_cos | |
|---|---|---|---|
| 0 | 1 | 0.269797 | 0.962917 |
| 1 | 5 | 0.979084 | 0.203456 |
| 2 | 7 | 0.942261 | -0.334880 |
| 3 | 7 | 0.942261 | -0.334880 |
| 4 | 17 | -0.997669 | -0.068242 |
We can see the correlation by their website (feature-engine) and visualize it as a circle. From the following page: https://feature-engine.trainindata.com/en/latest/user_guide/creation/CyclicalFeatures.html#
fig, ax = plt.subplots(figsize=(7, 5))
sp = ax.scatter(test["Hour_sin"], test["Hour_cos"], c=test["Hour"])
ax.set(
xlabel="sin(hour)",
ylabel="cos(hour)",
)
_ = fig.colorbar(sp)
Visualizing the (x,y) cicle coordinates generated by the sine and cosine features.
Now we can actually, place these values in our dataframe, but this will be two features now, instead of one:
cyclical = CyclicalFeatures(variables = None, drop_original=True) # we can drop the orig. featues, no use for us now.
new_hour_df_train = transform_from_hour_to_cyclical(result_df_train["times"])
new_hour_df_test = transform_from_hour_to_cyclical(result_df_test["times"])
hour_train = cyclical.fit_transform(new_hour_df_train[["Hour"]]) #selecting dataframe!
hour_test = cyclical.fit_transform(new_hour_df_test[["Hour"]]) #selecting dataframe!
hour_train.head()
| Hour_sin | Hour_cos | |
|---|---|---|
| 0 | 0.269797 | 0.962917 |
| 1 | 0.979084 | 0.203456 |
| 2 | 0.942261 | -0.334880 |
| 3 | 0.942261 | -0.334880 |
| 4 | -0.997669 | -0.068242 |
result_df_train.reset_index(drop=True, inplace=True) # avoiding potential issues when concatination is happening...
hour_train.reset_index(drop=True, inplace=True)
result_df_train = pd.concat([result_df_train, hour_train[['Hour_sin', 'Hour_cos']]], axis=1) # adding to original dataframe
result_df_test.reset_index(drop=True, inplace=True) # avoiding potential issues when concatination is happening...
hour_test.reset_index(drop=True, inplace=True)
result_df_test = pd.concat([result_df_test, hour_test[['Hour_sin', 'Hour_cos']]], axis=1) # adding to original dataframe
result_df_train.drop("times", axis=1, inplace=True) # we can remove start_time as we now have a encoded version.
result_df_test.drop("times", axis=1, inplace=True) # we can remove start_time as we now have a encoded version.
result_df_train
| Melding ID | latitude | longitude | main_species | vessel_ratio(height/width) | start_date | time_duration | total_weight | tools_used | species_weights_list | Hour_sin | Hour_cos | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1497249 | 74.811 | 36.665 | Other | 4.459821 | 2018-01-01 | 101 | 871.0 | Teiner | [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 871.0] | 2.697968e-01 | 0.962917 |
| 1 | 1497288 | 69.744 | 16.516 | Sei | 3.054444 | 2018-01-01 | 881 | 5304.0 | Udefinert garn | [2100.0, 2895.0, 54.0, 95.0, 0.0, 0.0, 16.0] | 9.790841e-01 | 0.203456 |
| 2 | 1497306 | 72.866 | 29.105 | Torsk | 4.658000 | 2018-01-01 | 900 | 11321.0 | Andre liner | [8371.0, 0.0, 2257.0, 0.0, 0.0, 0.0, 660.0] | 9.422609e-01 | -0.334880 |
| 3 | 1497310 | 58.636 | 0.876 | Lange | 3.467143 | 2018-01-01 | 249 | 2994.0 | Dobbeltrål | [188.0, 480.0, 0.0, 1392.0, 0.0, 0.0, 874.0] | 9.422609e-01 | -0.334880 |
| 4 | 1497311 | 73.127 | 28.324 | Torsk | 4.014286 | 2018-01-01 | 87 | 4131.0 | Bunntrål | [3850.0, 0.0, 202.0, 0.0, 0.0, 0.0, 79.0] | -9.976688e-01 | -0.068242 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 36795 | 1800267 | 72.840 | 28.893 | Torsk | 4.148438 | 2018-12-31 | 1138 | 28453.0 | Andre liner | [22110.0, 0.0, 6343.0, 0.0, 0.0, 0.0, 0.0] | 8.878852e-01 | 0.460065 |
| 36796 | 1800269 | 70.844 | 50.071 | Hyse | 4.271429 | 2018-12-31 | 1226 | 25363.0 | Andre liner | [10107.0, 0.0, 15201.0, 0.0, 0.0, 0.0, 55.0] | 5.195840e-01 | 0.854419 |
| 36797 | 1800285 | 74.892 | 17.255 | Torsk | 4.410256 | 2018-12-31 | 317 | 29247.0 | Bunntrål | [20316.0, 0.0, 7303.0, 0.0, 0.0, 0.0, 667.0] | 0.000000e+00 | 1.000000 |
| 36798 | 1800286 | 70.888 | 22.321 | Sei | 3.789524 | 2018-12-31 | 152 | 20262.0 | Bunntrål | [4117.0, 15749.0, 258.0, 0.0, 0.0, 0.0, 138.0] | 6.310879e-01 | -0.775711 |
| 36799 | 1800291 | 76.509 | 14.295 | Hyse | 4.547619 | 2018-12-30 | 301 | 45742.0 | Bunntrål | [16725.0, 32.0, 27144.0, 0.0, 0.0, 0.0, 998.0] | -2.449294e-16 | 1.000000 |
36800 rows × 12 columns
Now we have encoded our hour (time) by cyclical encoding, we want to do the same for our dates.
result_df_train[["start_date"]]
| start_date | |
|---|---|
| 0 | 2018-01-01 |
| 1 | 2018-01-01 |
| 2 | 2018-01-01 |
| 3 | 2018-01-01 |
| 4 | 2018-01-01 |
| ... | ... |
| 36795 | 2018-12-31 |
| 36796 | 2018-12-31 |
| 36797 | 2018-12-31 |
| 36798 | 2018-12-31 |
| 36799 | 2018-12-30 |
36800 rows × 1 columns
Beware they are dateFrame objects, so we have to handle them differently than previously with time, this because of our splitting method, check out Splitting our data section above.
def transform_from_date_to_cyclical(df):
# Converting the dataframe to a dataframe if it isnt already.
test_dataframe = df.to_frame() if not isinstance(df, pd.DataFrame) else df
values_test = test_dataframe["start_date"].values
date_list = pd.to_datetime(values_test).strftime('%Y-%m-%d').str.split("-").tolist() # splitting to date, monnth, year
date_list = [[value[1], value[2]] for value in date_list]# GEtting both the month and date, not year
df_dates = pd.DataFrame(date_list, columns=['Month', 'Day'])
df_dates['Day'] = pd.to_numeric(df_dates['Day']) # to numeric values
df_dates['Month'] = pd.to_numeric(df_dates['Month'])
return df_dates
new_date_df_train = transform_from_date_to_cyclical(result_df_train["start_date"])
new_date_df_test = transform_from_date_to_cyclical(result_df_test["start_date"])
new_date_df_train
| Month | Day | |
|---|---|---|
| 0 | 1 | 1 |
| 1 | 1 | 1 |
| 2 | 1 | 1 |
| 3 | 1 | 1 |
| 4 | 1 | 1 |
| ... | ... | ... |
| 36795 | 12 | 31 |
| 36796 | 12 | 31 |
| 36797 | 12 | 31 |
| 36798 | 12 | 31 |
| 36799 | 12 | 30 |
36800 rows × 2 columns
date_cyclical_train = cyclical.fit_transform(new_date_df_train[["Day", "Month"]]) #selecting dataframe
date_cyclical_test = cyclical.fit_transform(new_date_df_test[["Day", "Month"]])
date_cyclical_train.head()
| Day_sin | Day_cos | Month_sin | Month_cos | |
|---|---|---|---|---|
| 0 | 0.201299 | 0.97953 | 0.5 | 0.866025 |
| 1 | 0.201299 | 0.97953 | 0.5 | 0.866025 |
| 2 | 0.201299 | 0.97953 | 0.5 | 0.866025 |
| 3 | 0.201299 | 0.97953 | 0.5 | 0.866025 |
| 4 | 0.201299 | 0.97953 | 0.5 | 0.866025 |
result_df_train.reset_index(drop=True, inplace=True) # avoiding potential issues when concatination is happening...
date_cyclical_train.reset_index(drop=True, inplace=True)
result_df_train = pd.concat([result_df_train, date_cyclical_train[['Day_sin', 'Day_cos', 'Month_sin', 'Month_cos']]], axis=1) # adding to original dataframe
result_df_test.reset_index(drop=True, inplace=True) # avoiding potential issues when concatination is happening...
date_cyclical_test.reset_index(drop=True, inplace=True)
result_df_test = pd.concat([result_df_test, date_cyclical_test[['Day_sin', 'Day_cos', 'Month_sin', 'Month_cos']]], axis=1) # adding to original datafram
When concating here we want to make sure that no extra values are being added and then just "wiped" off, we make sure of this by doing the following: ::: #TO DO
result_df_train.drop("start_date", axis=1, inplace=True) # we can remove start_date as we now have a encoded version.
result_df_test.drop("start_date", axis=1, inplace=True) # we can remove start_date as we now have a encoded version.
We are going to remove the total_weight, since it wont be used to predict anything in our model. As well as our main_species, these were both used for visual information before.
result_df_train.drop("total_weight", axis=1, inplace=True) # we can remove total_weight
result_df_test.drop("total_weight", axis=1, inplace=True) # we can remove total_weight
result_df_train.drop("main_species", axis=1, inplace=True) # we can remove main_species
result_df_test.drop("main_species", axis=1, inplace=True) # we can remove main_species
result_df_train
| Melding ID | latitude | longitude | vessel_ratio(height/width) | time_duration | tools_used | species_weights_list | Hour_sin | Hour_cos | Day_sin | Day_cos | Month_sin | Month_cos | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1497249 | 74.811 | 36.665 | 4.459821 | 101 | Teiner | [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 871.0] | 2.697968e-01 | 0.962917 | 2.012985e-01 | 0.97953 | 5.000000e-01 | 0.866025 |
| 1 | 1497288 | 69.744 | 16.516 | 3.054444 | 881 | Udefinert garn | [2100.0, 2895.0, 54.0, 95.0, 0.0, 0.0, 16.0] | 9.790841e-01 | 0.203456 | 2.012985e-01 | 0.97953 | 5.000000e-01 | 0.866025 |
| 2 | 1497306 | 72.866 | 29.105 | 4.658000 | 900 | Andre liner | [8371.0, 0.0, 2257.0, 0.0, 0.0, 0.0, 660.0] | 9.422609e-01 | -0.334880 | 2.012985e-01 | 0.97953 | 5.000000e-01 | 0.866025 |
| 3 | 1497310 | 58.636 | 0.876 | 3.467143 | 249 | Dobbeltrål | [188.0, 480.0, 0.0, 1392.0, 0.0, 0.0, 874.0] | 9.422609e-01 | -0.334880 | 2.012985e-01 | 0.97953 | 5.000000e-01 | 0.866025 |
| 4 | 1497311 | 73.127 | 28.324 | 4.014286 | 87 | Bunntrål | [3850.0, 0.0, 202.0, 0.0, 0.0, 0.0, 79.0] | -9.976688e-01 | -0.068242 | 2.012985e-01 | 0.97953 | 5.000000e-01 | 0.866025 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 36795 | 1800267 | 72.840 | 28.893 | 4.148438 | 1138 | Andre liner | [22110.0, 0.0, 6343.0, 0.0, 0.0, 0.0, 0.0] | 8.878852e-01 | 0.460065 | -2.449294e-16 | 1.00000 | -2.449294e-16 | 1.000000 |
| 36796 | 1800269 | 70.844 | 50.071 | 4.271429 | 1226 | Andre liner | [10107.0, 0.0, 15201.0, 0.0, 0.0, 0.0, 55.0] | 5.195840e-01 | 0.854419 | -2.449294e-16 | 1.00000 | -2.449294e-16 | 1.000000 |
| 36797 | 1800285 | 74.892 | 17.255 | 4.410256 | 317 | Bunntrål | [20316.0, 0.0, 7303.0, 0.0, 0.0, 0.0, 667.0] | 0.000000e+00 | 1.000000 | -2.449294e-16 | 1.00000 | -2.449294e-16 | 1.000000 |
| 36798 | 1800286 | 70.888 | 22.321 | 3.789524 | 152 | Bunntrål | [4117.0, 15749.0, 258.0, 0.0, 0.0, 0.0, 138.0] | 6.310879e-01 | -0.775711 | -2.449294e-16 | 1.00000 | -2.449294e-16 | 1.000000 |
| 36799 | 1800291 | 76.509 | 14.295 | 4.547619 | 301 | Bunntrål | [16725.0, 32.0, 27144.0, 0.0, 0.0, 0.0, 998.0] | -2.449294e-16 | 1.000000 | -2.012985e-01 | 0.97953 | -2.449294e-16 | 1.000000 |
36800 rows × 13 columns
type(result_df_train['species_weights_list'])
pandas.core.series.Series
Before we apply it to our sum_based_normalization function we want to scale it, we are going to use Min-Max scaler, EXPLAIN WHY.
weights_series_train = result_df_train['species_weights_list']
weights_df_train = pd.DataFrame(weights_series_train.tolist(), index=result_df_train.index)
# we are creating a dataframe with our scaled values, for representing how our scaled values are.
scaled_weights_df_train = pd.DataFrame(minmax_scaler.fit_transform(weights_df_train), columns=weights_df_train.columns, index=weights_df_train.index)
weights_series_test = result_df_test['species_weights_list']
weights_df_test = pd.DataFrame(weights_series_test.tolist(), index=result_df_test.index)
# we are creating a dataframe with our scaled values, for representing how our scaled values are.
scaled_weights_df_test = pd.DataFrame(minmax_scaler.fit_transform(weights_df_test), columns=weights_df_test.columns, index=weights_df_test.index)
scaled_weights_df_train
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | |
|---|---|---|---|---|---|---|---|
| 0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.000238 |
| 1 | 0.004719 | 0.011092 | 0.000475 | 0.004934 | 0.0 | 0.0 | 0.000004 |
| 2 | 0.018811 | 0.000000 | 0.019853 | 0.000000 | 0.0 | 0.0 | 0.000180 |
| 3 | 0.000422 | 0.001839 | 0.000000 | 0.072289 | 0.0 | 0.0 | 0.000239 |
| 4 | 0.008652 | 0.000000 | 0.001777 | 0.000000 | 0.0 | 0.0 | 0.000022 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 36795 | 0.049686 | 0.000000 | 0.055793 | 0.000000 | 0.0 | 0.0 | 0.000000 |
| 36796 | 0.022713 | 0.000000 | 0.133708 | 0.000000 | 0.0 | 0.0 | 0.000015 |
| 36797 | 0.045655 | 0.000000 | 0.064237 | 0.000000 | 0.0 | 0.0 | 0.000182 |
| 36798 | 0.009252 | 0.060341 | 0.002269 | 0.000000 | 0.0 | 0.0 | 0.000038 |
| 36799 | 0.037585 | 0.000123 | 0.238759 | 0.000000 | 0.0 | 0.0 | 0.000273 |
36800 rows × 7 columns
# simple sum_based_normalization tecnique, if its 0 we want to make sure its still 0, rest is quite self-explanatory.
def sum_based_normalization(x):
total_sum = sum(x)
normed = [float(i) / total_sum if total_sum != 0 else 0 for i in x]
return normed
# we used softmax before
normalization_weights_df_train = scaled_weights_df_train.apply(sum_based_normalization, axis=1)
result_df_train['species_weights_list'] = normalization_weights_df_train.values.tolist()
normalization_weights_df_test = scaled_weights_df_test.apply(sum_based_normalization, axis=1)
result_df_test['species_weights_list'] = normalization_weights_df_test.values.tolist()
result_df_train
| Melding ID | latitude | longitude | vessel_ratio(height/width) | time_duration | tools_used | species_weights_list | Hour_sin | Hour_cos | Day_sin | Day_cos | Month_sin | Month_cos | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1497249 | 74.811 | 36.665 | 4.459821 | 101 | Teiner | [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0] | 2.697968e-01 | 0.962917 | 2.012985e-01 | 0.97953 | 5.000000e-01 | 0.866025 |
| 1 | 1497288 | 69.744 | 16.516 | 3.054444 | 881 | Udefinert garn | [0.2223503753251861, 0.5226136974019973, 0.022... | 9.790841e-01 | 0.203456 | 2.012985e-01 | 0.97953 | 5.000000e-01 | 0.866025 |
| 2 | 1497306 | 72.866 | 29.105 | 4.658000 | 900 | Andre liner | [0.4842780764348872, 0.0, 0.5110796099926982, ... | 9.422609e-01 | -0.334880 | 2.012985e-01 | 0.97953 | 5.000000e-01 | 0.866025 |
| 3 | 1497310 | 58.636 | 0.876 | 3.467143 | 249 | Dobbeltrål | [0.005648888357181467, 0.024590084867846508, 0... | 9.422609e-01 | -0.334880 | 2.012985e-01 | 0.97953 | 5.000000e-01 | 0.866025 |
| 4 | 1497311 | 73.127 | 28.324 | 4.014286 | 87 | Bunntrål | [0.8279094299794811, 0.0, 0.1700250839952644, ... | -9.976688e-01 | -0.068242 | 2.012985e-01 | 0.97953 | 5.000000e-01 | 0.866025 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 36795 | 1800267 | 72.840 | 28.893 | 4.148438 | 1138 | Andre liner | [0.47105121379633835, 0.0, 0.5289487862036616,... | 8.878852e-01 | 0.460065 | -2.449294e-16 | 1.00000 | -2.449294e-16 | 1.000000 |
| 36796 | 1800269 | 70.844 | 50.071 | 4.271429 | 1226 | Andre liner | [0.1451884738353054, 0.0, 0.8547154654897384, ... | 5.195840e-01 | 0.854419 | -2.449294e-16 | 1.00000 | -2.449294e-16 | 1.000000 |
| 36797 | 1800285 | 74.892 | 17.255 | 4.410256 | 317 | Bunntrål | [0.41476232791645434, 0.0, 0.5835820546036395,... | 0.000000e+00 | 1.000000 | -2.449294e-16 | 1.00000 | -2.449294e-16 | 1.000000 |
| 36798 | 1800286 | 70.888 | 22.321 | 3.789524 | 152 | Bunntrål | [0.12867629450715945, 0.8392363958041694, 0.03... | 6.310879e-01 | -0.775711 | -2.449294e-16 | 1.00000 | -2.449294e-16 | 1.000000 |
| 36799 | 1800291 | 76.509 | 14.295 | 4.547619 | 301 | Bunntrål | [0.13581319586500068, 0.00044303646043079186, ... | -2.449294e-16 | 1.000000 | -2.012985e-01 | 0.97953 | -2.449294e-16 | 1.000000 |
36800 rows × 13 columns
result_df_train["species_weights_list"].iloc[0] # quite interesting, meaning that all values are of "other" category!
[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0]
result_df_train.head()
| Melding ID | latitude | longitude | vessel_ratio(height/width) | time_duration | tools_used | species_weights_list | Hour_sin | Hour_cos | Day_sin | Day_cos | Month_sin | Month_cos | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1497249 | 74.811 | 36.665 | 4.459821 | 101 | Teiner | [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0] | 0.269797 | 0.962917 | 0.201299 | 0.97953 | 0.5 | 0.866025 |
| 1 | 1497288 | 69.744 | 16.516 | 3.054444 | 881 | Udefinert garn | [0.2223503753251861, 0.5226136974019973, 0.022... | 0.979084 | 0.203456 | 0.201299 | 0.97953 | 0.5 | 0.866025 |
| 2 | 1497306 | 72.866 | 29.105 | 4.658000 | 900 | Andre liner | [0.4842780764348872, 0.0, 0.5110796099926982, ... | 0.942261 | -0.334880 | 0.201299 | 0.97953 | 0.5 | 0.866025 |
| 3 | 1497310 | 58.636 | 0.876 | 3.467143 | 249 | Dobbeltrål | [0.005648888357181467, 0.024590084867846508, 0... | 0.942261 | -0.334880 | 0.201299 | 0.97953 | 0.5 | 0.866025 |
| 4 | 1497311 | 73.127 | 28.324 | 4.014286 | 87 | Bunntrål | [0.8279094299794811, 0.0, 0.1700250839952644, ... | -0.997669 | -0.068242 | 0.201299 | 0.97953 | 0.5 | 0.866025 |
result_df_train["species_weights_list"].iloc[2]
[0.4842780764348872, 0.0, 0.5110796099926982, 0.0, 0.0, 0.0, 0.004642313572414772]
Using min-max scaling on our time_duration, because we want to not care to much about outliers and we want to perserve the original distribution, we also want them to stay between 0 - 1, for simplicity and since some of our other data does aswell.
result_df_train['time_duration'] = standard_scaler.fit_transform(result_df_train[['time_duration']])
result_df_test['time_duration'] = standard_scaler.fit_transform(result_df_test[['time_duration']])
result_df_train
| Melding ID | latitude | longitude | vessel_ratio(height/width) | time_duration | tools_used | species_weights_list | Hour_sin | Hour_cos | Day_sin | Day_cos | Month_sin | Month_cos | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1497249 | 74.811 | 36.665 | 4.459821 | -0.208002 | Teiner | [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0] | 2.697968e-01 | 0.962917 | 2.012985e-01 | 0.97953 | 5.000000e-01 | 0.866025 |
| 1 | 1497288 | 69.744 | 16.516 | 3.054444 | 0.096073 | Udefinert garn | [0.2223503753251861, 0.5226136974019973, 0.022... | 9.790841e-01 | 0.203456 | 2.012985e-01 | 0.97953 | 5.000000e-01 | 0.866025 |
| 2 | 1497306 | 72.866 | 29.105 | 4.658000 | 0.103480 | Andre liner | [0.4842780764348872, 0.0, 0.5110796099926982, ... | 9.422609e-01 | -0.334880 | 2.012985e-01 | 0.97953 | 5.000000e-01 | 0.866025 |
| 3 | 1497310 | 58.636 | 0.876 | 3.467143 | -0.150306 | Dobbeltrål | [0.005648888357181467, 0.024590084867846508, 0... | 9.422609e-01 | -0.334880 | 2.012985e-01 | 0.97953 | 5.000000e-01 | 0.866025 |
| 4 | 1497311 | 73.127 | 28.324 | 4.014286 | -0.213460 | Bunntrål | [0.8279094299794811, 0.0, 0.1700250839952644, ... | -9.976688e-01 | -0.068242 | 2.012985e-01 | 0.97953 | 5.000000e-01 | 0.866025 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 36795 | 1800267 | 72.840 | 28.893 | 4.148438 | 0.196262 | Andre liner | [0.47105121379633835, 0.0, 0.5289487862036616,... | 8.878852e-01 | 0.460065 | -2.449294e-16 | 1.00000 | -2.449294e-16 | 1.000000 |
| 36796 | 1800269 | 70.844 | 50.071 | 4.271429 | 0.230568 | Andre liner | [0.1451884738353054, 0.0, 0.8547154654897384, ... | 5.195840e-01 | 0.854419 | -2.449294e-16 | 1.00000 | -2.449294e-16 | 1.000000 |
| 36797 | 1800285 | 74.892 | 17.255 | 4.410256 | -0.123797 | Bunntrål | [0.41476232791645434, 0.0, 0.5835820546036395,... | 0.000000e+00 | 1.000000 | -2.449294e-16 | 1.00000 | -2.449294e-16 | 1.000000 |
| 36798 | 1800286 | 70.888 | 22.321 | 3.789524 | -0.188121 | Bunntrål | [0.12867629450715945, 0.8392363958041694, 0.03... | 6.310879e-01 | -0.775711 | -2.449294e-16 | 1.00000 | -2.449294e-16 | 1.000000 |
| 36799 | 1800291 | 76.509 | 14.295 | 4.547619 | -0.130034 | Bunntrål | [0.13581319586500068, 0.00044303646043079186, ... | -2.449294e-16 | 1.000000 | -2.012985e-01 | 0.97953 | -2.449294e-16 | 1.000000 |
36800 rows × 13 columns
from sklearn.preprocessing import OneHotEncoder
result_df_train = pd.concat([result_df_train, pd.get_dummies(result_df_train["tools_used"], prefix="tools").astype(int)], axis=1)
result_df_train = result_df_train.drop("tools_used", axis=1)
result_df_test = pd.concat([result_df_test, pd.get_dummies(result_df_test["tools_used"], prefix="tools").astype(int)], axis=1)
result_df_test = result_df_test.drop("tools_used", axis=1)
result_df_train
| Melding ID | latitude | longitude | vessel_ratio(height/width) | time_duration | species_weights_list | Hour_sin | Hour_cos | Day_sin | Day_cos | ... | tools_Bunntrål | tools_Bunntrål par | tools_Dobbeltrål | tools_Other | tools_Reketrål | tools_Snurpenot/ringnot | tools_Snurrevad | tools_Teiner | tools_Udefinert garn | tools_Udefinert trål | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1497249 | 74.811 | 36.665 | 4.459821 | -0.208002 | [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0] | 2.697968e-01 | 0.962917 | 2.012985e-01 | 0.97953 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 1 | 1497288 | 69.744 | 16.516 | 3.054444 | 0.096073 | [0.2223503753251861, 0.5226136974019973, 0.022... | 9.790841e-01 | 0.203456 | 2.012985e-01 | 0.97953 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 2 | 1497306 | 72.866 | 29.105 | 4.658000 | 0.103480 | [0.4842780764348872, 0.0, 0.5110796099926982, ... | 9.422609e-01 | -0.334880 | 2.012985e-01 | 0.97953 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 1497310 | 58.636 | 0.876 | 3.467143 | -0.150306 | [0.005648888357181467, 0.024590084867846508, 0... | 9.422609e-01 | -0.334880 | 2.012985e-01 | 0.97953 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 1497311 | 73.127 | 28.324 | 4.014286 | -0.213460 | [0.8279094299794811, 0.0, 0.1700250839952644, ... | -9.976688e-01 | -0.068242 | 2.012985e-01 | 0.97953 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 36795 | 1800267 | 72.840 | 28.893 | 4.148438 | 0.196262 | [0.47105121379633835, 0.0, 0.5289487862036616,... | 8.878852e-01 | 0.460065 | -2.449294e-16 | 1.00000 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 36796 | 1800269 | 70.844 | 50.071 | 4.271429 | 0.230568 | [0.1451884738353054, 0.0, 0.8547154654897384, ... | 5.195840e-01 | 0.854419 | -2.449294e-16 | 1.00000 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 36797 | 1800285 | 74.892 | 17.255 | 4.410256 | -0.123797 | [0.41476232791645434, 0.0, 0.5835820546036395,... | 0.000000e+00 | 1.000000 | -2.449294e-16 | 1.00000 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 36798 | 1800286 | 70.888 | 22.321 | 3.789524 | -0.188121 | [0.12867629450715945, 0.8392363958041694, 0.03... | 6.310879e-01 | -0.775711 | -2.449294e-16 | 1.00000 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 36799 | 1800291 | 76.509 | 14.295 | 4.547619 | -0.130034 | [0.13581319586500068, 0.00044303646043079186, ... | -2.449294e-16 | 1.000000 | -2.012985e-01 | 0.97953 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
36800 rows × 23 columns
We will just remove main_location, though it could be possible to perhaps use it, we will just use lat/long instead, we tried baseN encoding our main_location but that would imply that there is a ordinary relationship between them - which there most likely isn't.
And now the min-max scaler for vessel_ratio. Since we want to presenve it inside a fixed range, just like time_duration
result_df_train['vessel_ratio(height/width)'] = minmax_scaler.fit_transform(result_df_train[['vessel_ratio(height/width)']])
result_df_test['vessel_ratio(height/width)'] = minmax_scaler.fit_transform(result_df_test[['vessel_ratio(height/width)']])
result_df_train
| Melding ID | latitude | longitude | vessel_ratio(height/width) | time_duration | species_weights_list | Hour_sin | Hour_cos | Day_sin | Day_cos | ... | tools_Bunntrål | tools_Bunntrål par | tools_Dobbeltrål | tools_Other | tools_Reketrål | tools_Snurpenot/ringnot | tools_Snurrevad | tools_Teiner | tools_Udefinert garn | tools_Udefinert trål | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1497249 | 74.811 | 36.665 | 0.460517 | -0.208002 | [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0] | 2.697968e-01 | 0.962917 | 2.012985e-01 | 0.97953 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 1 | 1497288 | 69.744 | 16.516 | 0.174123 | 0.096073 | [0.2223503753251861, 0.5226136974019973, 0.022... | 9.790841e-01 | 0.203456 | 2.012985e-01 | 0.97953 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 2 | 1497306 | 72.866 | 29.105 | 0.500902 | 0.103480 | [0.4842780764348872, 0.0, 0.5110796099926982, ... | 9.422609e-01 | -0.334880 | 2.012985e-01 | 0.97953 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 1497310 | 58.636 | 0.876 | 0.258224 | -0.150306 | [0.005648888357181467, 0.024590084867846508, 0... | 9.422609e-01 | -0.334880 | 2.012985e-01 | 0.97953 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 1497311 | 73.127 | 28.324 | 0.369723 | -0.213460 | [0.8279094299794811, 0.0, 0.1700250839952644, ... | -9.976688e-01 | -0.068242 | 2.012985e-01 | 0.97953 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 36795 | 1800267 | 72.840 | 28.893 | 0.397061 | 0.196262 | [0.47105121379633835, 0.0, 0.5289487862036616,... | 8.878852e-01 | 0.460065 | -2.449294e-16 | 1.00000 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 36796 | 1800269 | 70.844 | 50.071 | 0.422125 | 0.230568 | [0.1451884738353054, 0.0, 0.8547154654897384, ... | 5.195840e-01 | 0.854419 | -2.449294e-16 | 1.00000 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 36797 | 1800285 | 74.892 | 17.255 | 0.450416 | -0.123797 | [0.41476232791645434, 0.0, 0.5835820546036395,... | 0.000000e+00 | 1.000000 | -2.449294e-16 | 1.00000 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 36798 | 1800286 | 70.888 | 22.321 | 0.323920 | -0.188121 | [0.12867629450715945, 0.8392363958041694, 0.03... | 6.310879e-01 | -0.775711 | -2.449294e-16 | 1.00000 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 36799 | 1800291 | 76.509 | 14.295 | 0.478409 | -0.130034 | [0.13581319586500068, 0.00044303646043079186, ... | -2.449294e-16 | 1.000000 | -2.012985e-01 | 0.97953 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
36800 rows × 23 columns
Coordinates have different ranges (lat/long) therefor we can use StandardScaler here.
result_df_train['latitude'] = standard_scaler.fit_transform(result_df_train[['latitude']])
result_df_test['latitude'] = standard_scaler.fit_transform(result_df_test[['latitude']])
result_df_train['longitude'] = standard_scaler.fit_transform(result_df_train[['longitude']])
result_df_test['longitude'] = standard_scaler.fit_transform(result_df_test[['longitude']])
Message ID dosent give us any info that our prediction needs, therefor we remove it, and so our dataframe will look like this.
result_df_train.drop("Melding ID", axis=1, inplace=True) # we can remove Melding ID (message ID)
result_df_test.drop("Melding ID", axis=1, inplace=True) # we can remove Melding ID (message ID)
result_df_train
| latitude | longitude | vessel_ratio(height/width) | time_duration | species_weights_list | Hour_sin | Hour_cos | Day_sin | Day_cos | Month_sin | ... | tools_Bunntrål | tools_Bunntrål par | tools_Dobbeltrål | tools_Other | tools_Reketrål | tools_Snurpenot/ringnot | tools_Snurrevad | tools_Teiner | tools_Udefinert garn | tools_Udefinert trål | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1.312376 | 1.625922 | 0.460517 | -0.208002 | [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0] | 2.697968e-01 | 0.962917 | 2.012985e-01 | 0.97953 | 5.000000e-01 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 1 | 0.515580 | 0.175410 | 0.174123 | 0.096073 | [0.2223503753251861, 0.5226136974019973, 0.022... | 9.790841e-01 | 0.203456 | 2.012985e-01 | 0.97953 | 5.000000e-01 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 2 | 1.006521 | 1.081683 | 0.500902 | 0.103480 | [0.4842780764348872, 0.0, 0.5110796099926982, ... | 9.422609e-01 | -0.334880 | 2.012985e-01 | 0.97953 | 5.000000e-01 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | -1.231177 | -0.950503 | 0.258224 | -0.150306 | [0.005648888357181467, 0.024590084867846508, 0... | 9.422609e-01 | -0.334880 | 2.012985e-01 | 0.97953 | 5.000000e-01 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 1.047564 | 1.025459 | 0.369723 | -0.213460 | [0.8279094299794811, 0.0, 0.1700250839952644, ... | -9.976688e-01 | -0.068242 | 2.012985e-01 | 0.97953 | 5.000000e-01 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 36795 | 1.002433 | 1.066421 | 0.397061 | 0.196262 | [0.47105121379633835, 0.0, 0.5289487862036616,... | 8.878852e-01 | 0.460065 | -2.449294e-16 | 1.00000 | -2.449294e-16 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 36796 | 0.688557 | 2.591010 | 0.422125 | 0.230568 | [0.1451884738353054, 0.0, 0.8547154654897384, ... | 5.195840e-01 | 0.854419 | -2.449294e-16 | 1.00000 | -2.449294e-16 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 36797 | 1.325114 | 0.228610 | 0.450416 | -0.123797 | [0.41476232791645434, 0.0, 0.5835820546036395,... | 0.000000e+00 | 1.000000 | -2.449294e-16 | 1.00000 | -2.449294e-16 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 36798 | 0.695476 | 0.593307 | 0.323920 | -0.188121 | [0.12867629450715945, 0.8392363958041694, 0.03... | 6.310879e-01 | -0.775711 | -2.449294e-16 | 1.00000 | -2.449294e-16 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 36799 | 1.579391 | 0.015521 | 0.478409 | -0.130034 | [0.13581319586500068, 0.00044303646043079186, ... | -2.449294e-16 | 1.000000 | -2.012985e-01 | 0.97953 | -2.449294e-16 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
36800 rows × 22 columns
We have sucessfully encoded or scaled all of our data and can now move on to the supervised learning section see below
Starting off with splitting the data correctly to training and test-sets, we want to also predict and work with a list with values when it comes to our target feature, so we need to convert it (see below)
X_train = result_df_train.drop("species_weights_list", axis=1)
y_train = result_df_train["species_weights_list"]
X_test = result_df_test.drop("species_weights_list", axis=1)
y_test = result_df_test["species_weights_list"]
# have to convert to a numpy array for it to work. (2D)
y_train_array = np.array([np.array(x) for x in y_train])
# have to convert to a numpy array for it to work. (2D)
y_test_array = np.array([np.array(x) for x in y_test])
Now that we have our data partioned as we want, let's try some algorithms on it:
Most code under is inspired by the book once again, check out https://github.com/amueller/introduction_to_ml_with_python/blob/main/02-supervised-learning.ipynb Chapter 02 - supervised learning.
reg = KNeighborsRegressor(n_neighbors=10)
reg.fit(X_train, y_train_array)
KNeighborsRegressor(n_neighbors=10)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
KNeighborsRegressor(n_neighbors=10)
print("Test set predictions:\n", reg.predict(X_test))
Test set predictions: [[6.55179125e-01 1.43950766e-01 1.52086672e-01 ... 0.00000000e+00 0.00000000e+00 2.02244749e-04] [3.35823119e-01 0.00000000e+00 6.61259476e-01 ... 0.00000000e+00 0.00000000e+00 2.91740488e-03] [5.17841663e-01 7.18455728e-02 4.09561723e-01 ... 0.00000000e+00 0.00000000e+00 7.51041884e-04] ... [3.95666875e-01 0.00000000e+00 5.99972551e-01 ... 0.00000000e+00 0.00000000e+00 4.36057464e-03] [5.01471659e-01 0.00000000e+00 4.91860391e-01 ... 0.00000000e+00 0.00000000e+00 6.66795036e-03] [3.00067606e-01 2.00276767e-03 5.95031969e-01 ... 0.00000000e+00 0.00000000e+00 2.89765737e-03]]
# this is just for seeing how close these values are, to get some idea of how good/bad it i
comparison = pd.DataFrame({'Actual': y_test_array.flatten(), 'Predicted': reg.predict(X_test).flatten()})
comparison
| Actual | Predicted | |
|---|---|---|
| 0 | 0.995320 | 0.655179 |
| 1 | 0.000000 | 0.143951 |
| 2 | 0.002583 | 0.152087 |
| 3 | 0.000000 | 0.048581 |
| 4 | 0.000000 | 0.000000 |
| ... | ... | ... |
| 64731 | 0.349780 | 0.595032 |
| 64732 | 0.000000 | 0.000000 |
| 64733 | 0.000000 | 0.000000 |
| 64734 | 0.000000 | 0.000000 |
| 64735 | 0.002383 | 0.002898 |
64736 rows × 2 columns
see above we see that our predicted values actually look somewhat promising, and aren't too far off the actual values. Let's now see the R^2 score below:
print("Test set R^2: {:.2f}".format(reg.score(X_test, y_test_array)))
Test set R^2: 0.73
from sklearn.tree import DecisionTreeRegressor
tree_regressor = DecisionTreeRegressor(random_state=0)
tree_regressor.fit(X_train, y_train_array)
print("R-squared on training set: {:.3f}".format(tree_regressor.score(X_train, y_train_array)))
print("R-squared on test set: {:.3f}".format(tree_regressor.score(X_test, y_test_array)))
R-squared on training set: 1.000 R-squared on test set: 0.689
Most likely overfitting, lets try to avoid that by minimizing the tree:
tree = DecisionTreeRegressor(max_depth=11, random_state=0)
tree.fit(X_train, y_train_array)
print("Accuracy on training set: {:.3f}".format(tree.score(X_train, y_train_array)))
print("Accuracy on test set: {:.3f}".format(tree.score(X_test, y_test_array)))
Accuracy on training set: 0.828 Accuracy on test set: 0.743
Lets try to get an idea of how much each feature could perhaps matter, and how much of "importance" they are: Again found in: https://github.com/amueller/introduction_to_ml_with_python/blob/main/02-supervised-learning.ipynb chapter 2 - supervised learning - "Feature importance of trees"
def plot_feature_importances_fish(model, feature_names):
n_features_importances = len(model.feature_importances_)
print(f"Number of features in dataset: {len(feature_names)}")
print(f"Length of feature importances: {n_features_importances}")
plt.barh(np.arange(n_features_importances), model.feature_importances_, align='center')
plt.yticks(np.arange(n_features_importances), feature_names[:n_features_importances]) # getting the features.
plt.xlabel("Feature importance")
plt.ylabel("Feature")
plt.ylim(-1, n_features_importances)
feature_names = result_df_train.columns.tolist()
plot_feature_importances_fish(tree, feature_names)
Number of features in dataset: 22 Length of feature importances: 21
Now we see some interesting data from above, that perhaps decision trees dont actually capture the cyclical encoding that we do, since it seems to put more weight on Day_cos than Day_sin (two features), which is really both representing the day (one feature), but we will not consider this too much in our analysis and continue regarding it as working as it should, i recommend checking out: https://towardsdatascience.com/cyclical-features-encoding-its-about-time-ce23581845ca for more info about this potential problem.
Let's now try to use things like Randomforests and Gradient boosting:
from sklearn.ensemble import RandomForestRegressor
forest = RandomForestRegressor(n_estimators=10, random_state=0, max_depth=13)
forest.fit(X_train, y_train_array)
RandomForestRegressor(max_depth=13, n_estimators=10, random_state=0)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
RandomForestRegressor(max_depth=13, n_estimators=10, random_state=0)
print("Accuracy on training set: {:.3f}".format(forest.score(X_train, y_train_array)))
print("Accuracy on test set: {:.3f}".format(forest.score(X_test, y_test_array)))
Accuracy on training set: 0.901 Accuracy on test set: 0.803
Pretty solid score for randomforest's, we can also try GBR below: Thanks to sklearn's MultioutputRegressor, thanks to amine! https://stackoverflow.com/questions/58113265/how-to-predict-multi-outputs-using-gradient-boosting-regression
from sklearn.multioutput import MultiOutputRegressor
from sklearn.ensemble import GradientBoostingRegressor
gbrt = GradientBoostingRegressor(random_state=0)
multi_output_gbrt = MultiOutputRegressor(gbrt)
multi_output_gbrt.fit(X_train, y_train_array)
train_r2_score = multi_output_gbrt.score(X_train, y_train_array)
print("Accuracy on training set: {:.3f}".format(train_r2_score))
test_r2_score = multi_output_gbrt.score(X_test, y_test_array)
print("Accuracy on test set: {:.3f}".format(test_r2_score))
Accuracy on training set: 0.768 Accuracy on test set: 0.721
We are going to create a neural network from scratch using this source, mainly: https://machinelearningmastery.com/develop-your-first-neural-network-with-pytorch-step-by-step/ keep in mind we are using PyTourch
X_train_numpy = X_train.to_numpy()
X_test_numpy = X_test.to_numpy()
X_train_tensor = torch.tensor(X_train_numpy, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train_array, dtype=torch.float32)
X_test_tensor = torch.tensor(X_test_numpy, dtype=torch.float32)
y_test_tensor = torch.tensor(y_test_array, dtype=torch.float32)
X_train_tensor.shape
torch.Size([36800, 21])
y_train_tensor.shape
torch.Size([36800, 7])
class PimaClassifier(nn.Module):
def __init__(self):
super().__init__()
self.hidden1 = nn.Linear(21, 30)
self.act1 = nn.Tanh()
self.hidden2 = nn.Linear(30, 13)
self.act2 = nn.Tanh()
self.output = nn.Linear(13, 7)
#self.act_output = nn.Tanh()
def forward(self, x):
x = self.act1(self.hidden1(x))
x = self.act2(self.hidden2(x))
# x = self.act_output(self.output(x))
x = self.output(x)
return x
model = PimaClassifier()
print(model)
PimaClassifier( (hidden1): Linear(in_features=21, out_features=30, bias=True) (act1): Tanh() (hidden2): Linear(in_features=30, out_features=13, bias=True) (act2): Tanh() (output): Linear(in_features=13, out_features=7, bias=True) )
loss_fn = nn.MSELoss() #loss function
optimizer = optim.Adam(model.parameters(), lr=0.001)
n_epochs = 100
batch_size = 10
patience = 5 # this can be adjusted, dependent on how much "patience" one has to getting a better model or not.
best_val_loss = float('inf')
counter = 0
for epoch in range(n_epochs):
model.train() # setting to training mode.
epoch_loss = 0.0
for i in range(0, len(X_train_tensor), batch_size):
X_batch = X_train_tensor[i:i+batch_size]
y_pred = model(X_batch)
y_batch = y_train_tensor[i:i+batch_size]
loss = loss_fn(y_pred, y_batch)
optimizer.zero_grad()
loss.backward()
optimizer.step()
epoch_loss += loss.item() * len(X_batch)
# compute (average) training loss for the epoch
epoch_loss /= len(X_train_tensor)
model.eval() # setting to validation mode.
with torch.no_grad():
y_val_pred = model(X_test_tensor)
val_loss = loss_fn(y_val_pred, y_test_tensor)
print(f"Epoch {epoch+1}/{n_epochs}, Training Loss: {epoch_loss:.4f}, Validation Loss: {val_loss:.4f}")
# checking if the val. is becoming better
if val_loss < best_val_loss:
best_val_loss = val_loss
counter = 0 # resetting the counter
else:
counter += 1
# early stopping: Stop training if validation loss does not improve for patience epochs
if counter >= patience:
print(f"Validation loss did not improve for {patience} epochs. Stopping training.")
break
# we see our model is very slowly getting better, which could perhaps mean there are better solutions to this problem.
Epoch 1/100, Training Loss: 0.0172, Validation Loss: 0.0243 Epoch 2/100, Training Loss: 0.0172, Validation Loss: 0.0243 Epoch 3/100, Training Loss: 0.0172, Validation Loss: 0.0243 Epoch 4/100, Training Loss: 0.0172, Validation Loss: 0.0243 Epoch 5/100, Training Loss: 0.0172, Validation Loss: 0.0243 Epoch 6/100, Training Loss: 0.0172, Validation Loss: 0.0243 Epoch 7/100, Training Loss: 0.0172, Validation Loss: 0.0243 Epoch 8/100, Training Loss: 0.0172, Validation Loss: 0.0243 Epoch 9/100, Training Loss: 0.0171, Validation Loss: 0.0243 Epoch 10/100, Training Loss: 0.0171, Validation Loss: 0.0243 Epoch 11/100, Training Loss: 0.0171, Validation Loss: 0.0243 Epoch 12/100, Training Loss: 0.0171, Validation Loss: 0.0243 Epoch 13/100, Training Loss: 0.0171, Validation Loss: 0.0242 Epoch 14/100, Training Loss: 0.0171, Validation Loss: 0.0242 Epoch 15/100, Training Loss: 0.0171, Validation Loss: 0.0242 Epoch 16/100, Training Loss: 0.0171, Validation Loss: 0.0242 Epoch 17/100, Training Loss: 0.0171, Validation Loss: 0.0242 Epoch 18/100, Training Loss: 0.0170, Validation Loss: 0.0242 Epoch 19/100, Training Loss: 0.0170, Validation Loss: 0.0242 Epoch 20/100, Training Loss: 0.0170, Validation Loss: 0.0242 Epoch 21/100, Training Loss: 0.0170, Validation Loss: 0.0242 Epoch 22/100, Training Loss: 0.0170, Validation Loss: 0.0242 Epoch 23/100, Training Loss: 0.0170, Validation Loss: 0.0242 Epoch 24/100, Training Loss: 0.0170, Validation Loss: 0.0241 Epoch 25/100, Training Loss: 0.0170, Validation Loss: 0.0241 Epoch 26/100, Training Loss: 0.0170, Validation Loss: 0.0241 Epoch 27/100, Training Loss: 0.0169, Validation Loss: 0.0241 Epoch 28/100, Training Loss: 0.0169, Validation Loss: 0.0241 Epoch 29/100, Training Loss: 0.0169, Validation Loss: 0.0241 Epoch 30/100, Training Loss: 0.0169, Validation Loss: 0.0241 Epoch 31/100, Training Loss: 0.0169, Validation Loss: 0.0241 Epoch 32/100, Training Loss: 0.0169, Validation Loss: 0.0240 Epoch 33/100, Training Loss: 0.0169, Validation Loss: 0.0240 Epoch 34/100, Training Loss: 0.0169, Validation Loss: 0.0240 Epoch 35/100, Training Loss: 0.0169, Validation Loss: 0.0240 Epoch 36/100, Training Loss: 0.0168, Validation Loss: 0.0240 Epoch 37/100, Training Loss: 0.0168, Validation Loss: 0.0240 Epoch 38/100, Training Loss: 0.0168, Validation Loss: 0.0239 Epoch 39/100, Training Loss: 0.0168, Validation Loss: 0.0239 Epoch 40/100, Training Loss: 0.0168, Validation Loss: 0.0239 Epoch 41/100, Training Loss: 0.0168, Validation Loss: 0.0239 Epoch 42/100, Training Loss: 0.0168, Validation Loss: 0.0239 Epoch 43/100, Training Loss: 0.0168, Validation Loss: 0.0239 Epoch 44/100, Training Loss: 0.0168, Validation Loss: 0.0239 Epoch 45/100, Training Loss: 0.0168, Validation Loss: 0.0238 Epoch 46/100, Training Loss: 0.0167, Validation Loss: 0.0238 Epoch 47/100, Training Loss: 0.0167, Validation Loss: 0.0238 Epoch 48/100, Training Loss: 0.0167, Validation Loss: 0.0238 Epoch 49/100, Training Loss: 0.0167, Validation Loss: 0.0238 Epoch 50/100, Training Loss: 0.0167, Validation Loss: 0.0238 Epoch 51/100, Training Loss: 0.0167, Validation Loss: 0.0238 Epoch 52/100, Training Loss: 0.0167, Validation Loss: 0.0238 Epoch 53/100, Training Loss: 0.0167, Validation Loss: 0.0238 Epoch 54/100, Training Loss: 0.0167, Validation Loss: 0.0237 Epoch 55/100, Training Loss: 0.0167, Validation Loss: 0.0237 Epoch 56/100, Training Loss: 0.0167, Validation Loss: 0.0237 Epoch 57/100, Training Loss: 0.0166, Validation Loss: 0.0237 Epoch 58/100, Training Loss: 0.0166, Validation Loss: 0.0237 Epoch 59/100, Training Loss: 0.0166, Validation Loss: 0.0237 Epoch 60/100, Training Loss: 0.0166, Validation Loss: 0.0237 Epoch 61/100, Training Loss: 0.0166, Validation Loss: 0.0237 Epoch 62/100, Training Loss: 0.0166, Validation Loss: 0.0237 Epoch 63/100, Training Loss: 0.0166, Validation Loss: 0.0237 Epoch 64/100, Training Loss: 0.0166, Validation Loss: 0.0237 Epoch 65/100, Training Loss: 0.0166, Validation Loss: 0.0236 Epoch 66/100, Training Loss: 0.0166, Validation Loss: 0.0236 Epoch 67/100, Training Loss: 0.0165, Validation Loss: 0.0236 Epoch 68/100, Training Loss: 0.0165, Validation Loss: 0.0236 Epoch 69/100, Training Loss: 0.0165, Validation Loss: 0.0236 Epoch 70/100, Training Loss: 0.0165, Validation Loss: 0.0236 Epoch 71/100, Training Loss: 0.0165, Validation Loss: 0.0236 Epoch 72/100, Training Loss: 0.0165, Validation Loss: 0.0236 Epoch 73/100, Training Loss: 0.0165, Validation Loss: 0.0236 Epoch 74/100, Training Loss: 0.0165, Validation Loss: 0.0235 Epoch 75/100, Training Loss: 0.0165, Validation Loss: 0.0235 Epoch 76/100, Training Loss: 0.0165, Validation Loss: 0.0235 Epoch 77/100, Training Loss: 0.0165, Validation Loss: 0.0235 Epoch 78/100, Training Loss: 0.0165, Validation Loss: 0.0235 Epoch 79/100, Training Loss: 0.0164, Validation Loss: 0.0235 Epoch 80/100, Training Loss: 0.0164, Validation Loss: 0.0234 Epoch 81/100, Training Loss: 0.0164, Validation Loss: 0.0234 Epoch 82/100, Training Loss: 0.0164, Validation Loss: 0.0234 Epoch 83/100, Training Loss: 0.0164, Validation Loss: 0.0234 Epoch 84/100, Training Loss: 0.0164, Validation Loss: 0.0234 Epoch 85/100, Training Loss: 0.0164, Validation Loss: 0.0234 Epoch 86/100, Training Loss: 0.0164, Validation Loss: 0.0234 Epoch 87/100, Training Loss: 0.0164, Validation Loss: 0.0234 Epoch 88/100, Training Loss: 0.0164, Validation Loss: 0.0234 Epoch 89/100, Training Loss: 0.0164, Validation Loss: 0.0233 Epoch 90/100, Training Loss: 0.0164, Validation Loss: 0.0233 Epoch 91/100, Training Loss: 0.0164, Validation Loss: 0.0233 Epoch 92/100, Training Loss: 0.0164, Validation Loss: 0.0233 Epoch 93/100, Training Loss: 0.0163, Validation Loss: 0.0233 Epoch 94/100, Training Loss: 0.0163, Validation Loss: 0.0233 Epoch 95/100, Training Loss: 0.0163, Validation Loss: 0.0233 Epoch 96/100, Training Loss: 0.0163, Validation Loss: 0.0232 Epoch 97/100, Training Loss: 0.0163, Validation Loss: 0.0232 Epoch 98/100, Training Loss: 0.0163, Validation Loss: 0.0232 Epoch 99/100, Training Loss: 0.0163, Validation Loss: 0.0232 Epoch 100/100, Training Loss: 0.0163, Validation Loss: 0.0232
# compute accuracy (no_grad is optional)
with torch.no_grad():
y_pred = model(X_test_tensor)
accuracy = (y_pred.round() == y_test_tensor).float().mean()
print(f"Accuracy {accuracy}")
Accuracy 0.6388562917709351
We can take the following data from the start of the preprocessing phase, since we dont need target or test data either way:
combined_df
| Melding ID | latitude | longitude | main_species | vessel_ratio(height/width) | start_date | time_duration | total_weight | times | tools_used | species_weights_list | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1497249 | 74.811 | 36.665 | Other | 4.459821 | 2018-01-01 | 101 | 871.0 | 01:19 | Teiner | [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 871.0] |
| 1 | 1497288 | 69.744 | 16.516 | Sei | 3.054444 | 2018-01-01 | 881 | 5304.0 | 05:47 | Udefinert garn | [2100.0, 2895.0, 54.0, 95.0, 0.0, 0.0, 16.0] |
| 2 | 1497306 | 72.866 | 29.105 | Torsk | 4.658000 | 2018-01-01 | 900 | 11321.0 | 07:00 | Andre liner | [8371.0, 0.0, 2257.0, 0.0, 0.0, 0.0, 660.0] |
| 3 | 1497310 | 58.636 | 0.876 | Lange | 3.467143 | 2018-01-01 | 249 | 2994.0 | 07:09 | Dobbeltrål | [188.0, 480.0, 0.0, 1392.0, 0.0, 0.0, 874.0] |
| 4 | 1497311 | 73.127 | 28.324 | Torsk | 4.014286 | 2018-01-01 | 87 | 4131.0 | 17:09 | Bunntrål | [3850.0, 0.0, 202.0, 0.0, 0.0, 0.0, 79.0] |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 46043 | 1800240 | 74.650 | 36.783 | Other | 4.516129 | 2018-12-31 | 0 | 1774.0 | 22:17 | Teiner | [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1774.0] |
| 46044 | 1800245 | 57.774 | 5.861 | Sei | 3.261905 | 2018-12-31 | 364 | 1754.0 | 06:56 | Bunntrål | [71.0, 1062.0, 0.0, 36.0, 0.0, 0.0, 81.0] |
| 46045 | 1800252 | 71.317 | 24.700 | Hyse | 5.255814 | 2018-12-30 | 420 | 5228.0 | 23:00 | Andre liner | [1485.0, 0.0, 2633.0, 0.0, 0.0, 0.0, 1110.0] |
| 46046 | 1800263 | 75.352 | 14.944 | Hyse | 4.654545 | 2018-12-31 | 0 | 12307.0 | 23:26 | Andre liner | [4502.0, 0.0, 5295.0, 0.0, 0.0, 0.0, 2322.0] |
| 46047 | 1800268 | 74.957 | 16.174 | Torsk | 4.014286 | 2018-12-30 | 315 | 36879.0 | 22:50 | Bunntrål | [24090.0, 68.0, 11155.0, 0.0, 0.0, 0.0, 1170.0] |
46048 rows × 11 columns
type(combined_df)
pandas.core.frame.DataFrame
combined_df.drop("species_weights_list", axis=1, inplace=True) # we can remove our target feature, since it dosent give us much here.
One hot encoding our main_species and tools_used, but we apply PCA on it later, so that we dont get too many dimentions.
combined_df = pd.concat([combined_df, pd.get_dummies(combined_df["main_species"], prefix="species").astype(int)], axis=1)
combied_df = combined_df.drop("main_species", axis=1, inplace=True)
combined_df
| Melding ID | latitude | longitude | vessel_ratio(height/width) | start_date | time_duration | total_weight | times | tools_used | species_Blåkveite | species_Breiflabb | species_Brosme | species_Dypvannsreke | species_Hyse | species_Lange | species_Lysing | species_Other | species_Sei | species_Torsk | species_Uer (vanlig) | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1497249 | 74.811 | 36.665 | 4.459821 | 2018-01-01 | 101 | 871.0 | 01:19 | Teiner | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 1 | 1497288 | 69.744 | 16.516 | 3.054444 | 2018-01-01 | 881 | 5304.0 | 05:47 | Udefinert garn | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 2 | 1497306 | 72.866 | 29.105 | 4.658000 | 2018-01-01 | 900 | 11321.0 | 07:00 | Andre liner | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 3 | 1497310 | 58.636 | 0.876 | 3.467143 | 2018-01-01 | 249 | 2994.0 | 07:09 | Dobbeltrål | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| 4 | 1497311 | 73.127 | 28.324 | 4.014286 | 2018-01-01 | 87 | 4131.0 | 17:09 | Bunntrål | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 46043 | 1800240 | 74.650 | 36.783 | 4.516129 | 2018-12-31 | 0 | 1774.0 | 22:17 | Teiner | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 46044 | 1800245 | 57.774 | 5.861 | 3.261905 | 2018-12-31 | 364 | 1754.0 | 06:56 | Bunntrål | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 46045 | 1800252 | 71.317 | 24.700 | 5.255814 | 2018-12-30 | 420 | 5228.0 | 23:00 | Andre liner | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 46046 | 1800263 | 75.352 | 14.944 | 4.654545 | 2018-12-31 | 0 | 12307.0 | 23:26 | Andre liner | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 46047 | 1800268 | 74.957 | 16.174 | 4.014286 | 2018-12-30 | 315 | 36879.0 | 22:50 | Bunntrål | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
46048 rows × 20 columns
combined_df = pd.concat([combined_df, pd.get_dummies(combined_df["tools_used"], prefix="tools").astype(int)], axis=1)
combied_df = combined_df.drop("tools_used", axis=1, inplace=True)
combined_df
| Melding ID | latitude | longitude | vessel_ratio(height/width) | start_date | time_duration | total_weight | times | species_Blåkveite | species_Breiflabb | ... | tools_Bunntrål | tools_Bunntrål par | tools_Dobbeltrål | tools_Other | tools_Reketrål | tools_Snurpenot/ringnot | tools_Snurrevad | tools_Teiner | tools_Udefinert garn | tools_Udefinert trål | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1497249 | 74.811 | 36.665 | 4.459821 | 2018-01-01 | 101 | 871.0 | 01:19 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 1 | 1497288 | 69.744 | 16.516 | 3.054444 | 2018-01-01 | 881 | 5304.0 | 05:47 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 2 | 1497306 | 72.866 | 29.105 | 4.658000 | 2018-01-01 | 900 | 11321.0 | 07:00 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 1497310 | 58.636 | 0.876 | 3.467143 | 2018-01-01 | 249 | 2994.0 | 07:09 | 0 | 0 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 1497311 | 73.127 | 28.324 | 4.014286 | 2018-01-01 | 87 | 4131.0 | 17:09 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 46043 | 1800240 | 74.650 | 36.783 | 4.516129 | 2018-12-31 | 0 | 1774.0 | 22:17 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 46044 | 1800245 | 57.774 | 5.861 | 3.261905 | 2018-12-31 | 364 | 1754.0 | 06:56 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 46045 | 1800252 | 71.317 | 24.700 | 5.255814 | 2018-12-30 | 420 | 5228.0 | 23:00 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 46046 | 1800263 | 75.352 | 14.944 | 4.654545 | 2018-12-31 | 0 | 12307.0 | 23:26 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 46047 | 1800268 | 74.957 | 16.174 | 4.014286 | 2018-12-30 | 315 | 36879.0 | 22:50 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
46048 rows × 30 columns
See chapter 1.2 about decision tree, to see feature importance, we will remove the least important, things like times, and time_duration:
combied_df = combined_df.drop("times", axis=1, inplace=True)
combied_df = combined_df.drop("time_duration", axis=1, inplace=True)
combied_df = combined_df.drop("Melding ID", axis=1, inplace=True) # removing Melding ID, dont need it
combined_df
| latitude | longitude | vessel_ratio(height/width) | start_date | total_weight | species_Blåkveite | species_Breiflabb | species_Brosme | species_Dypvannsreke | species_Hyse | ... | tools_Bunntrål | tools_Bunntrål par | tools_Dobbeltrål | tools_Other | tools_Reketrål | tools_Snurpenot/ringnot | tools_Snurrevad | tools_Teiner | tools_Udefinert garn | tools_Udefinert trål | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 74.811 | 36.665 | 4.459821 | 2018-01-01 | 871.0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 1 | 69.744 | 16.516 | 3.054444 | 2018-01-01 | 5304.0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 2 | 72.866 | 29.105 | 4.658000 | 2018-01-01 | 11321.0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 58.636 | 0.876 | 3.467143 | 2018-01-01 | 2994.0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 73.127 | 28.324 | 4.014286 | 2018-01-01 | 4131.0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 46043 | 74.650 | 36.783 | 4.516129 | 2018-12-31 | 1774.0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 46044 | 57.774 | 5.861 | 3.261905 | 2018-12-31 | 1754.0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 46045 | 71.317 | 24.700 | 5.255814 | 2018-12-30 | 5228.0 | 0 | 0 | 0 | 0 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 46046 | 75.352 | 14.944 | 4.654545 | 2018-12-31 | 12307.0 | 0 | 0 | 0 | 0 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 46047 | 74.957 | 16.174 | 4.014286 | 2018-12-30 | 36879.0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
46048 rows × 27 columns
new_date_df_combined = transform_from_date_to_cyclical(combined_df["start_date"])
date_cyclical_combined = cyclical.fit_transform(new_date_df_combined[["Day", "Month"]]) #selecting dataframe!
combined_df.reset_index(drop=True, inplace=True) # avoiding potential issues when concatination is happening...
date_cyclical_combined.reset_index(drop=True, inplace=True)
combined_df = pd.concat([combined_df, date_cyclical_combined[['Day_sin', 'Day_cos', 'Month_sin', 'Month_cos']]], axis=1) # adding to original dataframe
combined_df.drop("start_date", axis=1, inplace=True) # we can remove start_date as we now have a encoded version.
combined_df
| latitude | longitude | vessel_ratio(height/width) | total_weight | species_Blåkveite | species_Breiflabb | species_Brosme | species_Dypvannsreke | species_Hyse | species_Lange | ... | tools_Reketrål | tools_Snurpenot/ringnot | tools_Snurrevad | tools_Teiner | tools_Udefinert garn | tools_Udefinert trål | Day_sin | Day_cos | Month_sin | Month_cos | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 74.811 | 36.665 | 4.459821 | 871.0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 2.012985e-01 | 0.97953 | 5.000000e-01 | 0.866025 |
| 1 | 69.744 | 16.516 | 3.054444 | 5304.0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 2.012985e-01 | 0.97953 | 5.000000e-01 | 0.866025 |
| 2 | 72.866 | 29.105 | 4.658000 | 11321.0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 2.012985e-01 | 0.97953 | 5.000000e-01 | 0.866025 |
| 3 | 58.636 | 0.876 | 3.467143 | 2994.0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 2.012985e-01 | 0.97953 | 5.000000e-01 | 0.866025 |
| 4 | 73.127 | 28.324 | 4.014286 | 4131.0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 2.012985e-01 | 0.97953 | 5.000000e-01 | 0.866025 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 46043 | 74.650 | 36.783 | 4.516129 | 1774.0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | -2.449294e-16 | 1.00000 | -2.449294e-16 | 1.000000 |
| 46044 | 57.774 | 5.861 | 3.261905 | 1754.0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | -2.449294e-16 | 1.00000 | -2.449294e-16 | 1.000000 |
| 46045 | 71.317 | 24.700 | 5.255814 | 5228.0 | 0 | 0 | 0 | 0 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | -2.012985e-01 | 0.97953 | -2.449294e-16 | 1.000000 |
| 46046 | 75.352 | 14.944 | 4.654545 | 12307.0 | 0 | 0 | 0 | 0 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | -2.449294e-16 | 1.00000 | -2.449294e-16 | 1.000000 |
| 46047 | 74.957 | 16.174 | 4.014286 | 36879.0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | -2.012985e-01 | 0.97953 | -2.449294e-16 | 1.000000 |
46048 rows × 30 columns
combined_df['vessel_ratio(height/width)'] = minmax_scaler.fit_transform(combined_df[['vessel_ratio(height/width)']])
combined_df['latitude'] = standard_scaler.fit_transform(combined_df[['latitude']])
combined_df['longitude'] = standard_scaler.fit_transform(combined_df[['longitude']])
Robustscaler is here used, since we dont want that outliers have such a big influene, and in total_weight we know there are some outliers (some very large and some very small values).
combined_df['total_weight'] = robust_scaler.fit_transform(combined_df[['total_weight']])
combined_df
| latitude | longitude | vessel_ratio(height/width) | total_weight | species_Blåkveite | species_Breiflabb | species_Brosme | species_Dypvannsreke | species_Hyse | species_Lange | ... | tools_Reketrål | tools_Snurpenot/ringnot | tools_Snurrevad | tools_Teiner | tools_Udefinert garn | tools_Udefinert trål | Day_sin | Day_cos | Month_sin | Month_cos | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1.309803 | 1.627054 | 0.460517 | -0.352687 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 2.012985e-01 | 0.97953 | 5.000000e-01 | 0.866025 |
| 1 | 0.513702 | 0.173175 | 0.174123 | -0.187695 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 2.012985e-01 | 0.97953 | 5.000000e-01 | 0.866025 |
| 2 | 1.004214 | 1.081552 | 0.500902 | 0.036251 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 2.012985e-01 | 0.97953 | 5.000000e-01 | 0.866025 |
| 3 | -1.231530 | -0.955351 | 0.258224 | -0.273671 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 2.012985e-01 | 0.97953 | 5.000000e-01 | 0.866025 |
| 4 | 1.045221 | 1.025198 | 0.369723 | -0.231353 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 2.012985e-01 | 0.97953 | 5.000000e-01 | 0.866025 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 46043 | 1.284507 | 1.635568 | 0.471991 | -0.319078 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | -2.449294e-16 | 1.00000 | -2.449294e-16 | 1.000000 |
| 46044 | -1.366963 | -0.595651 | 0.216400 | -0.319823 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | -2.449294e-16 | 1.00000 | -2.449294e-16 | 1.000000 |
| 46045 | 0.760843 | 0.763703 | 0.622728 | -0.190524 | 0 | 0 | 0 | 0 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | -2.012985e-01 | 0.97953 | -2.449294e-16 | 1.000000 |
| 46046 | 1.394802 | 0.059745 | 0.500198 | 0.072949 | 0 | 0 | 0 | 0 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | -2.449294e-16 | 1.00000 | -2.449294e-16 | 1.000000 |
| 46047 | 1.332741 | 0.148498 | 0.369723 | 0.987494 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | -2.012985e-01 | 0.97953 | -2.449294e-16 | 1.000000 |
46048 rows × 30 columns
Now we have a full dataset, with all encoded / scaled values, lets go into some unsupervised learning:
We will use t-SNE since our data is high dimentional, with perhaps some outliers, since we know some boats have a huge difference between their total weights.
tsne = TSNE(n_components=2, random_state=0)
tsne_data = tsne.fit_transform(combined_df)
plt.figure(figsize=(8, 6))
plt.scatter(tsne_data[:, 0], tsne_data[:, 1], alpha=0.5)
plt.xlabel('t-SNE Component 1')
plt.ylabel('t-SNE Component 2')
plt.title('t-SNE Plot')
#plt.grid(True)
plt.show()
We can try DBSCAN, since we see that density could help our case here:
dbscan = DBSCAN(eps=1, min_samples=8)
dbscan.fit(tsne_data)
# cluster labels
cluster_labels = dbscan.labels_
# cluster labels, ignoring noise.
n_clusters_ = len(set(cluster_labels)) - (1 if -1 in cluster_labels else 0)
n_noise_ = list(cluster_labels).count(-1)
print('Estimated number of clusters: %d' % n_clusters_)
print('Estimated number of noise points: %d' % n_noise_)
Estimated number of clusters: 1187 Estimated number of noise points: 2905
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 8))
plt.scatter(tsne_data[:, 0], tsne_data[:, 1], c=cluster_labels, cmap='viridis', s=10, alpha=0.5)
plt.title('DBSCAN Clustering on t-SNE-transformed Data')
plt.xlabel('t-SNE Component 1')
plt.ylabel('t-SNE Component 2')
plt.colorbar(label='Cluster Label')
plt.show()
Evaluation of unsupervised learning, without ground truth: see: https://towardsdatascience.com/silhouette-coefficient-validating-clustering-techniques-e976bb81d10c for more
silhouette_avg = silhouette_score(tsne_data, cluster_labels)
print("The average silhouette_score is :", silhouette_avg)
The average silhouette_score is : 0.3204624
cluster_labels
array([ 0, 1, 2, ..., 5, 869, 1115], dtype=int64)
Some inspiration from https://medium.com/@tarammullin/visualizing-dbscan-results-with-t-sne-plotly-e3742205c900, while also using logic from previous code, to show our "unsupervised results" in a more useful maner.
# creating a df with our t-SNE components
df = pd.DataFrame(tsne_data, columns=['t-SNE Component 1', 't-SNE Component 2'])
df['Cluster Label'] = cluster_labels
# check link, using plotly to make a figure that will be interactive.
fig = px.scatter(df, x='t-SNE Component 1', y='t-SNE Component 2', color='Cluster Label',
title='DBSCAN Clustering on t-SNE-transformed Data', opacity=0.5,
hover_name=df.index, size_max=10)
# showing the figure.
fig.update_layout(
xaxis_title='t-SNE Component 1',
yaxis_title='t-SNE Component 2',
coloraxis_colorbar=dict(title='Cluster Label'),
height=800
)
fig.show()
Lets start off by saying this project was a rollercoster of ups and downs though at the end i had a lot of fun making it. Our problem was that we wanted to predict from some data, the most common species (target feature), in like a list with percentages summing to 100%, see more in our discussion section above, this was clearly a regression problem and would not be such an easy problem to try to predict. We can start off by trying to acknowledge the results we got from Chapter 2 - Supervised Learning, where we start off with our KNN algorithm which actually works quite well for this problem case. Though i already knew it would not catch all the complex relationships, it still did beyong my expectations. Then we try deciosion trees, this is not only because decision trees ususally are a good algortihm for a bunch of problems, but also since we had a potential problem when it came to our cyclical encoding which i wanted to test, see Section about decision trees (random forests - more specifically). It still did quite well, obviously most likely overfitted to some degree, but we also knew that because of that potential problem decision trees would perhaps not be the most optimal algorithm. Now onto the last of them, deep learning, or neural networks, it took quite a while trying to tune it to do it, but at the end it did end up doing good, perhaps it is a local minimum (and it most likely is because i had limited time), but it seemed to grasp better the more complex relationships at least to some degree. Our unsupervised learning algorithm approach went from t-SNE which actually worked really well, since PCA did a horrible job, which i scratched off the project since it did not show anything informative, working from t-SNE we continued with trying to cluster this brain-like figure, with DBSCAN, since it seemed to be a density problem from the visualization, its quite hard to do so, and perhaps DBSCAN is not the most optimal, but it worked to some degree, also showing that there is quite a lot of different spanning-data here. This we can tell since looking at the representations of both the t-SNE and DBSCAN we see some dots at left side, a kind of valley to the top-right and most of the other data seems to just be a bunch of clusters, showing off that some data is quite different than the other, but still has some meaningfull realationships - in the sense that it seems to be multiple points of this data - not only one singular, which DBSCAN is awesome at visualizing. The silhoutte score is somewhat okay, since i think it will be quite hard to be able to cluster it well, since we discussed that the data seems to have big differenting sides at some points at least - like the examples i wrote a few sentences back.
The biggest issue i had along the project was the data, what should we use as training data and test data? I started off by just a random train_test_split at the start and some of my algorithms did not even go over a 0.2 R2 score, while my neural net did a -0.7 R2 score on testing data, while still doing quite well on the training data, what went wrong? I actually figured that since we have data spanning from the whole year (1 - 12 months), and since there is just a random split at the start, our training data could perhaps just be "exposed" (or train) on data that is in some specific months, but not all perhaps, or not enough on others, this could lead to a big inbalance, and when it came to the test data it had nothing or barely anything to go off of, so it would be bingo. This introduced me to a new problem i had to face, how should we split the data so it will generalize well across all the months of the year? (see section Splitting the data for more). We chose to train_test split in every month, so that we would get a lot of training and some test data to predict on for all months, which would be better for the training part (better to generalize for each given month), and for the testing part (testing on every month, since the results could be quite different from month to month), this would end up being a viable approach.
# End project.